diff --git a/docs/docs/Artifact-Extraction-Guide.md b/docs/docs/Artifact-Extraction-Guide.md
index 5672d781dc3b..b8702d2810fb 100644
--- a/docs/docs/Artifact-Extraction-Guide.md
+++ b/docs/docs/Artifact-Extraction-Guide.md
@@ -17,6 +17,9 @@ written to local shared storage, or to a remote S3 storage location. Refer to th
The choice of which artifacts to extract is highly configurable using the following properties.
+- `SUPPRESS_TRACKS`: When an action has this property set to `true`, no artifacts for that action
+ will be extracted and none of the other properties listed below will have any effect.
+
- `ARTIFACT_EXTRACTION_POLICY`: This property sets the high level policy controlling artifact extraction. It must have
one of the following values:
- `NONE`: No artifact extraction will be performed.
diff --git a/docs/docs/Derivative-Media-Guide.md b/docs/docs/Derivative-Media-Guide.md
index 0a2954221a36..9668d7911b20 100644
--- a/docs/docs/Derivative-Media-Guide.md
+++ b/docs/docs/Derivative-Media-Guide.md
@@ -67,150 +67,88 @@ To break down each stage of this pipeline:
- `KEYWORD TAGGING (WITH FF REGIONS) ACTION`: The KeywordTagging component will take the `TEXT` tracks from the
previous `TIKA TEXT` and `TESSERACT OCR` actions and perform keyword tagging. This will add the `TAGS`
- , `TRIGGER_WORDS`, and `TRIGGER_WORDS_OFFSET` properties to each track.
+ , `TRIGGER_WORDS`, and `TRIGGER_WORDS_OFFSET` properties to each track. The action has the
+ `IS_ANNOTATOR` property set to `TRUE`
- `OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION`: The Markup component will take the keyword-tagged `TEXT` tracks for
the derivative media and draw bounding boxes on the extracted images.
-## Task Merging
-The large blue rectangles in the diagram represent tasks that are merged together. The purpose of task merging is to
-consolidate how tracks are represented in the JSON output object by hiding redundant track information, and to make it
-appear that the behaviors of two or more actions are the result of a single algorithm.
+## Annotators
-For example, keyword tagging behavior is supplemental to the text detection behavior. It's more important that `TEXT`
-tracks are associated with the algorithm that performed text detection than the `KEYWORDTAGGING` algorithm. Note that in
-our pipeline only the `KEYWORD TAGGING` action has the `OUTPUT_MERGE_WITH_PREVIOUS_TASK` property set to `TRUE`. It has
-a similar effect in the source media flow and derivative media flow.
+When a pipeline does not use derivative media, an action with `IS_ANNOTATOR=true`, always annotates
+the action immediately proceeding it. When a pipeline uses derivative media, an action with
+`IS_ANNOTATOR=true` annotates the last action that was applicable to the media type. In the example
+above, `KEYWORD TAGGING` action, has `IS_ANNOTATOR=true`.
-In the source media flow the `TIKA TEXT` action is at the start of the merge chain while the `KEYWORD TAGGING` action is
-at the end of the merge chain. The tracks generated by the action at the end of the merge chain inherit the algorithm
-and track type from the tracks at the beginning of the merge chain. The effect is that in the JSON output object the
-tracks from the `TIKA TEXT` action will not be shown. Instead that action will be listed under `TRACKS MERGED`. The
-tracks from the `KEYWORD TAGGING` action will be shown with the `TIKATEXT` algorithm and `TEXT` track type:
+When determining which action `KEYWORD TAGGING` annotates in the source media flow, the
+`TESSERACT OCR` and `EAST` actions are considered, but are not selected because neither applies to
+the source media. The `TIKA TEXT` action is considered and then selected because it applies to the
+source media. Below is example output for the source media. The tracks contained in the `TEXT`
+section will include the properties added by the `TIKA TEXT` action and the properties added by the
+`KEYWORD TAGGING` action.
```json
-"output": {
- "TRACKS MERGED": [
- {
- "source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION",
- "algorithm": "TIKATEXT"
- }
- ],
- "MEDIA": [
- {
- "source": "+#TIKA IMAGE DETECTION ACTION",
- "algorithm": "TIKAIMAGE",
- "tracks": [ ... ]
- }
- ],
- "TEXT": [
- {
- "source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION",
- "algorithm": "TIKATEXT",
- "tracks": [ ... ]
- }
- ]
+{
+ "output": {
+ "MEDIA": [
+ {
+ "action": "TIKA IMAGE DETECTION ACTION",
+ "algorithm": "TIKAIMAGE",
+ "annotators": [],
+ "tracks": ["..."]
+ }
+ ],
+ "TEXT": [
+ {
+ "action": "TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION",
+ "algorithm": "TIKATEXT",
+ "annotators": ["KEYWORD TAGGING (WITH FF REGION) ACTION"],
+ "tracks": ["..."]
+ }
+ ]
+ }
}
```
-In the derivative media flow the `TESSERACT OCR` action is at the start of the merge chain while the `KEYWORD TAGGING`
-action is at the end of the merge chain. The effect is that in the JSON output object the tracks from
-the `TESSERACT OCR` action will not be shown. The tracks from the `KEYWORD TAGGING` action will be shown with
-the `TESSERACTOCR` algorithm and `TEXT` track type:
+When determining which action `KEYWORD TAGGING` annotates in the derivative media flow,
+`TESSERACT OCR` is selected because it is the first action before `KEYWORD TAGGING` that applies to
+derivative media. Below is example output for the derivative media. The tracks contained in the
+`TEXT` section will include the properties added by the `TESSERACT OCR` action and the properties
+added by the `KEYWORD TAGGING` action.
```json
-"output": {
- "NO TRACKS": [
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION#OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION",
- "algorithm": "MARKUPCV"
- }
- ],
- "TRACKS MERGED": [
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION",
- "algorithm": "TESSERACTOCR"
- }
- ],
- "TEXT": [
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION",
- "algorithm": "TESSERACTOCR",
- "tracks": [ ... ]
- }
- ],
- "TEXT REGION": [
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION",
- "algorithm": "EAST",
- "tracks": [ ... ]
- }
- ]
+{
+ "output": {
+ "NO TRACKS": [
+ {
+ "action": "OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION",
+ "algorithm": "MARKUPCV",
+ "annotators": [],
+ }
+ ],
+ "TEXT": [
+ {
+ "action": "TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION",
+ "algorithm": "TESSERACTOCR",
+ "annotators": ["KEYWORD TAGGING (WITH FF REGION) ACTION"],
+ "tracks": ["..."]
+ }
+ ],
+ "TEXT REGION": [
+ {
+ "action": "EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION",
+ "algorithm": "EAST",
+ "annotators": [],
+ "tracks": ["..."]
+ }
+ ]
+ }
}
```
+Note that a `MARKUP` action will never generate new tracks. It simply fills out the
+`media.markupResult` field in the JSON output object (not shown above).
-Note that a `MARKUP` action will never generate new tracks. It simply fills out the `media.markupResult` field in the
-JSON output object (not shown above).
-
-## Output Last Task Only
-
-If you want to omit all tracks from the JSON output object but the respective `TEXT` tracks for the source and
-derivative media, then in you can also set the `OUTPUT_LAST_TASK_ONLY` job property to `TRUE`. Note that the WFM only
-considers tasks that use `DETECTION` algorithms as the final task, so `MARKUP` is ignored. Setting this property will
-result in the following JSON for the source media:
-
-```json
-"output": {
- "TRACKS SUPPRESSED": [
- {
- "source": "+#TIKA IMAGE DETECTION ACTION",
- "algorithm": "TIKAIMAGE"
- },
- {
- "source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION",
- "algorithm": "TIKATEXT"
- }
- ],
- "TEXT": [
- {
- "source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION",
- "algorithm": "TIKATEXT",
- "tracks": [ ... ]
- }
- ]
-}
-```
-
-And the following JSON for the derivative media:
-
-```json
-"output": {
- "NO TRACKS": [
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION#OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION",
- "algorithm": "MARKUPCV"
- }
- ],
- "TRACKS SUPPRESSED": [
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION",
- "algorithm": "EAST"
- },
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION",
- "algorithm": "TESSERACTOCR"
- }
- ],
- "TEXT": [
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION",
- "algorithm": "TESSERACTOCR",
- "tracks": [ ... ]
- }
- ]
-}
-```
# Developing Media Extraction Components
@@ -235,4 +173,4 @@ that components in the subsequent pipeline stages can handle the media type dete
# Default Pipelines
-OpenMPF comes with some default pipelines for detecting text in documents and other pipelines for detecting faces in documents. Refer to the TikaImageDetection [`descriptor.json`](https://github.com/openmpf/openmpf-components/blob/master/java/TikaImageDetection/plugin-files/descriptor/descriptor.json).
\ No newline at end of file
+OpenMPF comes with some default pipelines for detecting text in documents and other pipelines for detecting faces in documents. Refer to the TikaImageDetection [`descriptor.json`](https://github.com/openmpf/openmpf-components/blob/master/java/TikaImageDetection/plugin-files/descriptor/descriptor.json).
diff --git a/docs/docs/Media-Selectors-Guide.md b/docs/docs/Media-Selectors-Guide.md
index 1122580f8c74..bae9f03677d8 100644
--- a/docs/docs/Media-Selectors-Guide.md
+++ b/docs/docs/Media-Selectors-Guide.md
@@ -23,7 +23,7 @@ pipeline. The first stage performs language identification. The second performs
{
"mediaUri": "file:///opt/mpf/share/remote-media/test-json-path-translation.json",
"properties": {},
- "mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) ACTION",
+ "mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NOT ANNOTATOR) ACTION",
"mediaSelectors": [
{
"type": "JSON_PATH",
@@ -406,7 +406,7 @@ pipeline. The first stage performs language identification. The second performs
{
"mediaUri": "file:///opt/mpf/share/remote-media/test-csv-translation.csv",
"properties": {},
- "mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) ACTION",
+ "mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NOT ANNOTATOR) ACTION",
"mediaSelectors": [
{
"type": "CSV_COLS",
diff --git a/docs/docs/img/derivative-media-pipeline.png b/docs/docs/img/derivative-media-pipeline.png
index 2c6523e1ee4a..b17ba93bcc40 100644
Binary files a/docs/docs/img/derivative-media-pipeline.png and b/docs/docs/img/derivative-media-pipeline.png differ
diff --git a/docs/site/Artifact-Extraction-Guide/index.html b/docs/site/Artifact-Extraction-Guide/index.html
index 47080cb21f9e..e1a6c873c263 100644
--- a/docs/site/Artifact-Extraction-Guide/index.html
+++ b/docs/site/Artifact-Extraction-Guide/index.html
@@ -279,8 +279,14 @@
The choice of which artifacts to extract is highly configurable using the following properties.
ARTIFACT_EXTRACTION_POLICY: This property sets the high level policy controlling artifact extraction. It must have
-one of the following values:SUPPRESS_TRACKS: When an action has this property set to true, no artifacts for that action
+ will be extracted and none of the other properties listed below will have any effect.
ARTIFACT_EXTRACTION_POLICY: This property sets the high level policy controlling artifact extraction. It must have
+one of the following values:
NONE: No artifact extraction will be performed.VISUAL_TYPES_ONLY: Extract artifacts only for tracks associated with a "visual" data type. Visual data types
include IMAGE and VIDEO.KEYWORD TAGGING (WITH FF REGIONS) ACTION: The KeywordTagging component will take the TEXT tracks from the
previous TIKA TEXT and TESSERACT OCR actions and perform keyword tagging. This will add the TAGS
- , TRIGGER_WORDS, and TRIGGER_WORDS_OFFSET properties to each track.
+ , TRIGGER_WORDS, and TRIGGER_WORDS_OFFSET properties to each track. The action has the
+ IS_ANNOTATOR property set to TRUE
OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION: The Markup component will take the keyword-tagged TEXT tracks for
the derivative media and draw bounding boxes on the extracted images.The large blue rectangles in the diagram represent tasks that are merged together. The purpose of task merging is to -consolidate how tracks are represented in the JSON output object by hiding redundant track information, and to make it -appear that the behaviors of two or more actions are the result of a single algorithm.
-For example, keyword tagging behavior is supplemental to the text detection behavior. It's more important that TEXT
-tracks are associated with the algorithm that performed text detection than the KEYWORDTAGGING algorithm. Note that in
-our pipeline only the KEYWORD TAGGING action has the OUTPUT_MERGE_WITH_PREVIOUS_TASK property set to TRUE. It has
-a similar effect in the source media flow and derivative media flow.
In the source media flow the TIKA TEXT action is at the start of the merge chain while the KEYWORD TAGGING action is
-at the end of the merge chain. The tracks generated by the action at the end of the merge chain inherit the algorithm
-and track type from the tracks at the beginning of the merge chain. The effect is that in the JSON output object the
-tracks from the TIKA TEXT action will not be shown. Instead that action will be listed under TRACKS MERGED. The
-tracks from the KEYWORD TAGGING action will be shown with the TIKATEXT algorithm and TEXT track type:
"output": {
- "TRACKS MERGED": [
- {
- "source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION",
- "algorithm": "TIKATEXT"
- }
- ],
- "MEDIA": [
- {
- "source": "+#TIKA IMAGE DETECTION ACTION",
- "algorithm": "TIKAIMAGE",
- "tracks": [ ... ]
- }
- ],
- "TEXT": [
- {
- "source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION",
- "algorithm": "TIKATEXT",
- "tracks": [ ... ]
- }
- ]
-}
-
-In the derivative media flow the TESSERACT OCR action is at the start of the merge chain while the KEYWORD TAGGING
-action is at the end of the merge chain. The effect is that in the JSON output object the tracks from
-the TESSERACT OCR action will not be shown. The tracks from the KEYWORD TAGGING action will be shown with
-the TESSERACTOCR algorithm and TEXT track type:
"output": {
- "NO TRACKS": [
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION#OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION",
- "algorithm": "MARKUPCV"
- }
- ],
- "TRACKS MERGED": [
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION",
- "algorithm": "TESSERACTOCR"
- }
- ],
- "TEXT": [
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION",
- "algorithm": "TESSERACTOCR",
- "tracks": [ ... ]
- }
- ],
- "TEXT REGION": [
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION",
- "algorithm": "EAST",
- "tracks": [ ... ]
- }
- ]
-}
-
-Note that a MARKUP action will never generate new tracks. It simply fills out the media.markupResult field in the
-JSON output object (not shown above).
If you want to omit all tracks from the JSON output object but the respective TEXT tracks for the source and
-derivative media, then in you can also set the OUTPUT_LAST_TASK_ONLY job property to TRUE. Note that the WFM only
-considers tasks that use DETECTION algorithms as the final task, so MARKUP is ignored. Setting this property will
-result in the following JSON for the source media:
"output": {
- "TRACKS SUPPRESSED": [
- {
- "source": "+#TIKA IMAGE DETECTION ACTION",
- "algorithm": "TIKAIMAGE"
- },
- {
- "source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION",
- "algorithm": "TIKATEXT"
- }
- ],
- "TEXT": [
- {
- "source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION",
- "algorithm": "TIKATEXT",
- "tracks": [ ... ]
- }
- ]
+Annotators
+When a pipeline does not use derivative media, an action with IS_ANNOTATOR=true, always annotates
+the action immediately proceeding it. When a pipeline uses derivative media, an action with
+IS_ANNOTATOR=true annotates the last action that was applicable to the media type. In the example
+above, KEYWORD TAGGING action, has IS_ANNOTATOR=true.
+When determining which action KEYWORD TAGGING annotates in the source media flow, the
+TESSERACT OCR and EAST actions are considered, but are not selected because neither applies to
+the source media. The TIKA TEXT action is considered and then selected because it applies to the
+source media. Below is example output for the source media. The tracks contained in the TEXT
+section will include the properties added by the TIKA TEXT action and the properties added by the
+KEYWORD TAGGING action.
+{
+ "output": {
+ "MEDIA": [
+ {
+ "action": "TIKA IMAGE DETECTION ACTION",
+ "algorithm": "TIKAIMAGE",
+ "annotators": [],
+ "tracks": ["..."]
+ }
+ ],
+ "TEXT": [
+ {
+ "action": "TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION",
+ "algorithm": "TIKATEXT",
+ "annotators": ["KEYWORD TAGGING (WITH FF REGION) ACTION"],
+ "tracks": ["..."]
+ }
+ ]
+ }
}
-And the following JSON for the derivative media:
-"output": {
- "NO TRACKS": [
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION#OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION",
- "algorithm": "MARKUPCV"
- }
- ],
- "TRACKS SUPPRESSED": [
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION",
- "algorithm": "EAST"
- },
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION",
- "algorithm": "TESSERACTOCR"
- }
- ],
- "TEXT": [
- {
- "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION",
- "algorithm": "TESSERACTOCR",
- "tracks": [ ... ]
- }
- ]
+When determining which action KEYWORD TAGGING annotates in the derivative media flow,
+TESSERACT OCR is selected because it is the first action before KEYWORD TAGGING that applies to
+derivative media. Below is example output for the derivative media. The tracks contained in the
+TEXT section will include the properties added by the TESSERACT OCR action and the properties
+added by the KEYWORD TAGGING action.
+{
+ "output": {
+ "NO TRACKS": [
+ {
+ "action": "OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION",
+ "algorithm": "MARKUPCV",
+ "annotators": [],
+ }
+ ],
+ "TEXT": [
+ {
+ "action": "TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION",
+ "algorithm": "TESSERACTOCR",
+ "annotators": ["KEYWORD TAGGING (WITH FF REGION) ACTION"],
+ "tracks": ["..."]
+ }
+ ],
+ "TEXT REGION": [
+ {
+ "action": "EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION",
+ "algorithm": "EAST",
+ "annotators": [],
+ "tracks": ["..."]
+ }
+ ]
+ }
}
+Note that a MARKUP action will never generate new tracks. It simply fills out the
+media.markupResult field in the JSON output object (not shown above).
Developing Media Extraction Components
The WFM is not limited to working only with the TikaImageDetection component. Any component can be designed to generate
derivative media. The requirement is that it must generate MEDIA tracks, one piece of derivative media per track.
diff --git a/docs/site/Media-Selectors-Guide/index.html b/docs/site/Media-Selectors-Guide/index.html
index 9ce87afcec48..4216b9f08cc7 100644
--- a/docs/site/Media-Selectors-Guide/index.html
+++ b/docs/site/Media-Selectors-Guide/index.html
@@ -281,7 +281,7 @@
New Job Request Fields
{
"mediaUri": "file:///opt/mpf/share/remote-media/test-json-path-translation.json",
"properties": {},
- "mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) ACTION",
+ "mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NOT ANNOTATOR) ACTION",
"mediaSelectors": [
{
"type": "JSON_PATH",
@@ -727,7 +727,7 @@ CSV_COLS Output File
{
"mediaUri": "file:///opt/mpf/share/remote-media/test-csv-translation.csv",
"properties": {},
- "mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) ACTION",
+ "mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NOT ANNOTATOR) ACTION",
"mediaSelectors": [
{
"type": "CSV_COLS",
diff --git a/docs/site/img/derivative-media-pipeline.png b/docs/site/img/derivative-media-pipeline.png
index 2c6523e1ee4a..095a4ac1f174 100644
Binary files a/docs/site/img/derivative-media-pipeline.png and b/docs/site/img/derivative-media-pipeline.png differ
diff --git a/docs/site/index.html b/docs/site/index.html
index fb30cd281786..14d5f36cd09d 100644
--- a/docs/site/index.html
+++ b/docs/site/index.html
@@ -443,5 +443,5 @@ Overview
diff --git a/docs/site/search/search_index.json b/docs/site/search/search_index.json
index db82e3caff1e..81e7559b818f 100644
--- a/docs/site/search/search_index.json
+++ b/docs/site/search/search_index.json
@@ -417,7 +417,7 @@
},
{
"location": "/Derivative-Media-Guide/index.html",
- "text": "NOTICE:\n This software (or technical data) was produced for the U.S. Government under contract, and is subject to the\nRights in Data-General Clause 52.227-14, Alt. IV (DEC 2007). Copyright 2024 The MITRE Corporation. All Rights Reserved.\n\n\nIntroduction\n\n\nThis guide covers the derivative media feature, which allows users to create pipelines where a component in one of\nthe initial stages of the pipeline generates one or more derivative (aka child) media from the source (aka parent)\nmedia. A common scenario is to extract images from PDFs or other document formats. Once extracted, the Workflow Manager\n(WFM) can perform the subsequent pipeline stages on the source media (if necessary) as well as the derivative media.\nThis differs from typical pipeline execution, which only acts on one or more pieces of source media.\n\n\nComponent actions can be configured to only be performed on source media or derivative media. This is often necessary\nbecause the source media has a different media type than the derivative media, and therefore different actions are\nrequired to process each type of media. For example, PDFs are assigned the \nUNKNOWN\n media type (since the WFM is not\ndesigned to handle them in any special way), while the images extracted from a PDF are assigned the \nIMAGE\n media type.\nAn action for the TikaTextDetection component can process the \nUNKNOWN\n source media to generate \nTEXT\n tracks by\ndetecting the embedded raw character data in the PDF itself, while an action for the TesseractOCRTextDetection component\ncan process the \nIMAGE\n derivative media to generate \nTEXT\n tracks by detecting text in the image data.\n\n\nText Detection Example\n\n\nConsider the following diagram which depicts a pipeline to accomplish generating \nTEXT\n tracks for PDFs which contain\nembedded raw character data and embedded images with text:\n\n\n\n\nEach block represents a single action performed in that stage of the pipeline. (Technically, a pipeline consists of\ntasks executed in sequence, but in this case each task consists of only one action, so we just show the actions.)\nActions that have \nSOURCE MEDIA ONLY\n in their name have the \nSOURCE_MEDIA_ONLY\n property set to \nTRUE\n, which will\nresult in completely skipping that action for derivative media. The component associated with the action will not\nreceive sub-job messages and there will be no representation of the action being executed on derivative media in the\nJSON output object.\n\n\nSimilarly, actions that have \nDERIVATIVE MEDIA ONLY\n in their name have the \nDERIVATIVE_MEDIA_ONLY\n property set\nto \nTRUE\n, which will result in completely skipping that action for source media. Note that setting both properties\nto \nTRUE\n will result in skipping the action for both derivative and source media, which means it will never be\nexecuted. Not setting either property will result in executing the action on both source and derivative media, as you\nsee in the diagram with the \nKEYWORD TAGGING\n action.\n\n\nNote that the actions shown in the source media flow and derivative media flow are \nnot\n executed at the same time.\nThe flows are shown in different rows in the diagram to illustrate the logical separation, not to illustrate\nconcurrency. To be clear, each action in the pipeline is executed sequentially. If an action is missing from a flow it\njust means that no sub-job messages are generated for that kind of media during that stage of the pipeline. If an action\nis shown in both flows then sub-jobs will be performed on both the source and derivative media during that stage.\n\n\nTo break down each stage of this pipeline:\n\n\n\n\nTIKA IMAGE DETECTION ACTION\n: The TikaImageDetection component will extract images from PDFs (or other document\n formats) and place them in \n$MPF_HOME/share/tmp/derivative-media/\n. One \nMEDIA\n track will be generated for\n each image and it will have \nDERIVATIVE_MEDIA_TEMP_PATH\n and \nPAGE_NUM\n track properties.\n\n\nIf remote storage is enabled, the WFM will upload the objects to the object store after this action is performed.\n Refer to the \nObject Storage Guide\n for more information.\n\n\nThe WFM will perform media inspection on the images at this time.\n\n\nEach piece of derivative media will have a parent media id set to the media id value of the source media. It will\n appear as \nmedia.parentMediaId\n in the JSON output object. For source media the value will be -1.\n\n\nEach piece of derivative media will have a \nmedia.mediaMetadata\n property of \nIS_DERIVATIVE_MEDIA\n set to \nTRUE\n.\n The metadata will also contain the \nPAGE_NUM\n property.\n \n\n\nTIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION\n: The TikaTextDetection component will generate \nTEXT\n tracks by\n detecting the embedded raw character data in the PDF.\n \n\n\nEAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION\n: The EastTextDetection component will generate \nTEXT REGION\n tracks\n for each text region in the extracted images.\n \n\n\nTESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION\n: The TesseractOCRTextDetection component\n will generate \nTEXT\n tracks by performing OCR on the text regions passed forward from the previous EAST action.\n \n\n\nKEYWORD TAGGING (WITH FF REGIONS) ACTION\n: The KeywordTagging component will take the \nTEXT\n tracks from the\n previous \nTIKA TEXT\n and \nTESSERACT OCR\n actions and perform keyword tagging. This will add the \nTAGS\n\n , \nTRIGGER_WORDS\n, and \nTRIGGER_WORDS_OFFSET\n properties to each track.\n \n\n\nOCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION\n: The Markup component will take the keyword-tagged \nTEXT\n tracks for\n the derivative media and draw bounding boxes on the extracted images.\n\n\n\n\nTask Merging\n\n\nThe large blue rectangles in the diagram represent tasks that are merged together. The purpose of task merging is to\nconsolidate how tracks are represented in the JSON output object by hiding redundant track information, and to make it\nappear that the behaviors of two or more actions are the result of a single algorithm.\n\n\nFor example, keyword tagging behavior is supplemental to the text detection behavior. It's more important that \nTEXT\n\ntracks are associated with the algorithm that performed text detection than the \nKEYWORDTAGGING\n algorithm. Note that in\nour pipeline only the \nKEYWORD TAGGING\n action has the \nOUTPUT_MERGE_WITH_PREVIOUS_TASK\n property set to \nTRUE\n. It has\na similar effect in the source media flow and derivative media flow.\n\n\nIn the source media flow the \nTIKA TEXT\n action is at the start of the merge chain while the \nKEYWORD TAGGING\n action is\nat the end of the merge chain. The tracks generated by the action at the end of the merge chain inherit the algorithm\nand track type from the tracks at the beginning of the merge chain. The effect is that in the JSON output object the\ntracks from the \nTIKA TEXT\n action will not be shown. Instead that action will be listed under \nTRACKS MERGED\n. The\ntracks from the \nKEYWORD TAGGING\n action will be shown with the \nTIKATEXT\n algorithm and \nTEXT\n track type:\n\n\n\"output\": {\n \"TRACKS MERGED\": [\n {\n \"source\": \"+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION\",\n \"algorithm\": \"TIKATEXT\"\n }\n ],\n \"MEDIA\": [\n {\n \"source\": \"+#TIKA IMAGE DETECTION ACTION\",\n \"algorithm\": \"TIKAIMAGE\",\n \"tracks\": [ ... ]\n }\n ],\n \"TEXT\": [\n {\n \"source\": \"+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION\",\n \"algorithm\": \"TIKATEXT\",\n \"tracks\": [ ... ]\n }\n ]\n}\n\n\n\nIn the derivative media flow the \nTESSERACT OCR\n action is at the start of the merge chain while the \nKEYWORD TAGGING\n\naction is at the end of the merge chain. The effect is that in the JSON output object the tracks from\nthe \nTESSERACT OCR\n action will not be shown. The tracks from the \nKEYWORD TAGGING\n action will be shown with\nthe \nTESSERACTOCR\n algorithm and \nTEXT\n track type:\n\n\n\"output\": {\n \"NO TRACKS\": [\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION#OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"MARKUPCV\"\n }\n ],\n \"TRACKS MERGED\": [\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"TESSERACTOCR\"\n }\n ],\n \"TEXT\": [\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION\",\n \"algorithm\": \"TESSERACTOCR\",\n \"tracks\": [ ... ]\n }\n ],\n \"TEXT REGION\": [\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"EAST\",\n \"tracks\": [ ... ]\n }\n ]\n}\n\n\n\nNote that a \nMARKUP\n action will never generate new tracks. It simply fills out the \nmedia.markupResult\n field in the\nJSON output object (not shown above).\n\n\nOutput Last Task Only\n\n\nIf you want to omit all tracks from the JSON output object but the respective \nTEXT\n tracks for the source and\nderivative media, then in you can also set the \nOUTPUT_LAST_TASK_ONLY\n job property to \nTRUE\n. Note that the WFM only\nconsiders tasks that use \nDETECTION\n algorithms as the final task, so \nMARKUP\n is ignored. Setting this property will\nresult in the following JSON for the source media:\n\n\n\"output\": {\n \"TRACKS SUPPRESSED\": [\n {\n \"source\": \"+#TIKA IMAGE DETECTION ACTION\",\n \"algorithm\": \"TIKAIMAGE\"\n },\n {\n \"source\": \"+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION\",\n \"algorithm\": \"TIKATEXT\"\n }\n ],\n \"TEXT\": [\n {\n \"source\": \"+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION\",\n \"algorithm\": \"TIKATEXT\", \n \"tracks\": [ ... ]\n }\n ]\n}\n\n\n\nAnd the following JSON for the derivative media:\n\n\n\"output\": {\n \"NO TRACKS\": [\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION#OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"MARKUPCV\"\n }\n ],\n \"TRACKS SUPPRESSED\": [\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"EAST\"\n },\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"TESSERACTOCR\"\n }\n ],\n \"TEXT\": [\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION\",\n \"algorithm\": \"TESSERACTOCR\",\n \"tracks\": [ ... ]\n }\n ]\n}\n\n\n\nDeveloping Media Extraction Components\n\n\nThe WFM is not limited to working only with the TikaImageDetection component. Any component can be designed to generate\nderivative media. The requirement is that it must generate \nMEDIA\n tracks, one piece of derivative media per track.\nMinimally, each track must have a \nDERIVATIVE_MEDIA_TEMP_PATH\n property set to the location of the media. By convention,\nthe media should be placed in a top-level directory of the form \n$MPF_HOME/share/tmp/derivative-media/\n. When\nthe job is done running, the media will be moved to persistent storage in \n$MPF_HOME/share/derivative-media/\n if\nremote storage is not enabled.\n\n\nSpecifically, TikaImageDetection uses paths of the\nform \n$MPF_HOME/share/tmp/derivative-media//tika-extracted//image.\n. The \n\n part ensures\nthat the results of two different actions executed within the same job on the same source media, or actions executed\nwithin the same job on different source media files, do not conflict with each other. A new \n\n is generated for\neach invocation of \nGetDetections()\n on the component.\n\n\nYour media extraction component can optionally include other track properties. These will get added to the derivative\nmedia metadata. For example, TikaImageDetection adds the \nPAGE_NUM\n property.\n\n\nNote that although this guide only talks about derivative images, your component can generate any kind of media. Be sure\nthat components in the subsequent pipeline stages can handle the media type detected by WFM media inspection.\n\n\nDefault Pipelines\n\n\nOpenMPF comes with some default pipelines for detecting text in documents and other pipelines for detecting faces in documents. Refer to the TikaImageDetection \ndescriptor.json\n.",
+ "text": "NOTICE:\n This software (or technical data) was produced for the U.S. Government under contract, and is subject to the\nRights in Data-General Clause 52.227-14, Alt. IV (DEC 2007). Copyright 2024 The MITRE Corporation. All Rights Reserved.\n\n\nIntroduction\n\n\nThis guide covers the derivative media feature, which allows users to create pipelines where a component in one of\nthe initial stages of the pipeline generates one or more derivative (aka child) media from the source (aka parent)\nmedia. A common scenario is to extract images from PDFs or other document formats. Once extracted, the Workflow Manager\n(WFM) can perform the subsequent pipeline stages on the source media (if necessary) as well as the derivative media.\nThis differs from typical pipeline execution, which only acts on one or more pieces of source media.\n\n\nComponent actions can be configured to only be performed on source media or derivative media. This is often necessary\nbecause the source media has a different media type than the derivative media, and therefore different actions are\nrequired to process each type of media. For example, PDFs are assigned the \nUNKNOWN\n media type (since the WFM is not\ndesigned to handle them in any special way), while the images extracted from a PDF are assigned the \nIMAGE\n media type.\nAn action for the TikaTextDetection component can process the \nUNKNOWN\n source media to generate \nTEXT\n tracks by\ndetecting the embedded raw character data in the PDF itself, while an action for the TesseractOCRTextDetection component\ncan process the \nIMAGE\n derivative media to generate \nTEXT\n tracks by detecting text in the image data.\n\n\nText Detection Example\n\n\nConsider the following diagram which depicts a pipeline to accomplish generating \nTEXT\n tracks for PDFs which contain\nembedded raw character data and embedded images with text:\n\n\n\n\nEach block represents a single action performed in that stage of the pipeline. (Technically, a pipeline consists of\ntasks executed in sequence, but in this case each task consists of only one action, so we just show the actions.)\nActions that have \nSOURCE MEDIA ONLY\n in their name have the \nSOURCE_MEDIA_ONLY\n property set to \nTRUE\n, which will\nresult in completely skipping that action for derivative media. The component associated with the action will not\nreceive sub-job messages and there will be no representation of the action being executed on derivative media in the\nJSON output object.\n\n\nSimilarly, actions that have \nDERIVATIVE MEDIA ONLY\n in their name have the \nDERIVATIVE_MEDIA_ONLY\n property set\nto \nTRUE\n, which will result in completely skipping that action for source media. Note that setting both properties\nto \nTRUE\n will result in skipping the action for both derivative and source media, which means it will never be\nexecuted. Not setting either property will result in executing the action on both source and derivative media, as you\nsee in the diagram with the \nKEYWORD TAGGING\n action.\n\n\nNote that the actions shown in the source media flow and derivative media flow are \nnot\n executed at the same time.\nThe flows are shown in different rows in the diagram to illustrate the logical separation, not to illustrate\nconcurrency. To be clear, each action in the pipeline is executed sequentially. If an action is missing from a flow it\njust means that no sub-job messages are generated for that kind of media during that stage of the pipeline. If an action\nis shown in both flows then sub-jobs will be performed on both the source and derivative media during that stage.\n\n\nTo break down each stage of this pipeline:\n\n\n\n\nTIKA IMAGE DETECTION ACTION\n: The TikaImageDetection component will extract images from PDFs (or other document\n formats) and place them in \n$MPF_HOME/share/tmp/derivative-media/\n. One \nMEDIA\n track will be generated for\n each image and it will have \nDERIVATIVE_MEDIA_TEMP_PATH\n and \nPAGE_NUM\n track properties.\n\n\nIf remote storage is enabled, the WFM will upload the objects to the object store after this action is performed.\n Refer to the \nObject Storage Guide\n for more information.\n\n\nThe WFM will perform media inspection on the images at this time.\n\n\nEach piece of derivative media will have a parent media id set to the media id value of the source media. It will\n appear as \nmedia.parentMediaId\n in the JSON output object. For source media the value will be -1.\n\n\nEach piece of derivative media will have a \nmedia.mediaMetadata\n property of \nIS_DERIVATIVE_MEDIA\n set to \nTRUE\n.\n The metadata will also contain the \nPAGE_NUM\n property.\n \n\n\nTIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION\n: The TikaTextDetection component will generate \nTEXT\n tracks by\n detecting the embedded raw character data in the PDF.\n \n\n\nEAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION\n: The EastTextDetection component will generate \nTEXT REGION\n tracks\n for each text region in the extracted images.\n \n\n\nTESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION\n: The TesseractOCRTextDetection component\n will generate \nTEXT\n tracks by performing OCR on the text regions passed forward from the previous EAST action.\n \n\n\nKEYWORD TAGGING (WITH FF REGIONS) ACTION\n: The KeywordTagging component will take the \nTEXT\n tracks from the\n previous \nTIKA TEXT\n and \nTESSERACT OCR\n actions and perform keyword tagging. This will add the \nTAGS\n\n , \nTRIGGER_WORDS\n, and \nTRIGGER_WORDS_OFFSET\n properties to each track. The action has the\n \nIS_ANNOTATOR\n property set to \nTRUE\n\n \n\n\nOCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION\n: The Markup component will take the keyword-tagged \nTEXT\n tracks for\n the derivative media and draw bounding boxes on the extracted images.\n\n\n\n\nAnnotators\n\n\nWhen a pipeline does not use derivative media, an action with \nIS_ANNOTATOR=true\n, always annotates\nthe action immediately proceeding it. When a pipeline uses derivative media, an action with\n\nIS_ANNOTATOR=true\n annotates the last action that was applicable to the media type. In the example\nabove, \nKEYWORD TAGGING\n action, has \nIS_ANNOTATOR=true\n.\n\n\nWhen determining which action \nKEYWORD TAGGING\n annotates in the source media flow, the\n\nTESSERACT OCR\n and \nEAST\n actions are considered, but are not selected because neither applies to\nthe source media. The \nTIKA TEXT\n action is considered and then selected because it applies to the\nsource media. Below is example output for the source media. The tracks contained in the \nTEXT\n\nsection will include the properties added by the \nTIKA TEXT\n action and the properties added by the\n\nKEYWORD TAGGING\n action.\n\n\n{\n \"output\": {\n \"MEDIA\": [\n {\n \"action\": \"TIKA IMAGE DETECTION ACTION\",\n \"algorithm\": \"TIKAIMAGE\",\n \"annotators\": [],\n \"tracks\": [\"...\"]\n }\n ],\n \"TEXT\": [\n {\n \"action\": \"TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION\",\n \"algorithm\": \"TIKATEXT\",\n \"annotators\": [\"KEYWORD TAGGING (WITH FF REGION) ACTION\"],\n \"tracks\": [\"...\"]\n }\n ]\n }\n}\n\n\n\nWhen determining which action \nKEYWORD TAGGING\n annotates in the derivative media flow,\n\nTESSERACT OCR\n is selected because it is the first action before \nKEYWORD TAGGING\n that applies to\nderivative media. Below is example output for the derivative media. The tracks contained in the\n\nTEXT\n section will include the properties added by the \nTESSERACT OCR\n action and the properties\nadded by the \nKEYWORD TAGGING\n action.\n\n\n{\n \"output\": {\n \"NO TRACKS\": [\n {\n \"action\": \"OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"MARKUPCV\",\n \"annotators\": [],\n }\n ],\n \"TEXT\": [\n {\n \"action\": \"TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"TESSERACTOCR\",\n \"annotators\": [\"KEYWORD TAGGING (WITH FF REGION) ACTION\"],\n \"tracks\": [\"...\"]\n }\n ],\n \"TEXT REGION\": [\n {\n \"action\": \"EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"EAST\",\n \"annotators\": [],\n \"tracks\": [\"...\"]\n }\n ]\n }\n}\n\n\n\nNote that a \nMARKUP\n action will never generate new tracks. It simply fills out the\n\nmedia.markupResult\n field in the JSON output object (not shown above).\n\n\nDeveloping Media Extraction Components\n\n\nThe WFM is not limited to working only with the TikaImageDetection component. Any component can be designed to generate\nderivative media. The requirement is that it must generate \nMEDIA\n tracks, one piece of derivative media per track.\nMinimally, each track must have a \nDERIVATIVE_MEDIA_TEMP_PATH\n property set to the location of the media. By convention,\nthe media should be placed in a top-level directory of the form \n$MPF_HOME/share/tmp/derivative-media/\n. When\nthe job is done running, the media will be moved to persistent storage in \n$MPF_HOME/share/derivative-media/\n if\nremote storage is not enabled.\n\n\nSpecifically, TikaImageDetection uses paths of the\nform \n$MPF_HOME/share/tmp/derivative-media//tika-extracted//image.\n. The \n\n part ensures\nthat the results of two different actions executed within the same job on the same source media, or actions executed\nwithin the same job on different source media files, do not conflict with each other. A new \n\n is generated for\neach invocation of \nGetDetections()\n on the component.\n\n\nYour media extraction component can optionally include other track properties. These will get added to the derivative\nmedia metadata. For example, TikaImageDetection adds the \nPAGE_NUM\n property.\n\n\nNote that although this guide only talks about derivative images, your component can generate any kind of media. Be sure\nthat components in the subsequent pipeline stages can handle the media type detected by WFM media inspection.\n\n\nDefault Pipelines\n\n\nOpenMPF comes with some default pipelines for detecting text in documents and other pipelines for detecting faces in documents. Refer to the TikaImageDetection \ndescriptor.json\n.",
"title": "Derivative Media Guide"
},
{
@@ -427,18 +427,13 @@
},
{
"location": "/Derivative-Media-Guide/index.html#text-detection-example",
- "text": "Consider the following diagram which depicts a pipeline to accomplish generating TEXT tracks for PDFs which contain\nembedded raw character data and embedded images with text: Each block represents a single action performed in that stage of the pipeline. (Technically, a pipeline consists of\ntasks executed in sequence, but in this case each task consists of only one action, so we just show the actions.)\nActions that have SOURCE MEDIA ONLY in their name have the SOURCE_MEDIA_ONLY property set to TRUE , which will\nresult in completely skipping that action for derivative media. The component associated with the action will not\nreceive sub-job messages and there will be no representation of the action being executed on derivative media in the\nJSON output object. Similarly, actions that have DERIVATIVE MEDIA ONLY in their name have the DERIVATIVE_MEDIA_ONLY property set\nto TRUE , which will result in completely skipping that action for source media. Note that setting both properties\nto TRUE will result in skipping the action for both derivative and source media, which means it will never be\nexecuted. Not setting either property will result in executing the action on both source and derivative media, as you\nsee in the diagram with the KEYWORD TAGGING action. Note that the actions shown in the source media flow and derivative media flow are not executed at the same time.\nThe flows are shown in different rows in the diagram to illustrate the logical separation, not to illustrate\nconcurrency. To be clear, each action in the pipeline is executed sequentially. If an action is missing from a flow it\njust means that no sub-job messages are generated for that kind of media during that stage of the pipeline. If an action\nis shown in both flows then sub-jobs will be performed on both the source and derivative media during that stage. To break down each stage of this pipeline: TIKA IMAGE DETECTION ACTION : The TikaImageDetection component will extract images from PDFs (or other document\n formats) and place them in $MPF_HOME/share/tmp/derivative-media/ . One MEDIA track will be generated for\n each image and it will have DERIVATIVE_MEDIA_TEMP_PATH and PAGE_NUM track properties. If remote storage is enabled, the WFM will upload the objects to the object store after this action is performed.\n Refer to the Object Storage Guide for more information. The WFM will perform media inspection on the images at this time. Each piece of derivative media will have a parent media id set to the media id value of the source media. It will\n appear as media.parentMediaId in the JSON output object. For source media the value will be -1. Each piece of derivative media will have a media.mediaMetadata property of IS_DERIVATIVE_MEDIA set to TRUE .\n The metadata will also contain the PAGE_NUM property.\n TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION : The TikaTextDetection component will generate TEXT tracks by\n detecting the embedded raw character data in the PDF.\n EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION : The EastTextDetection component will generate TEXT REGION tracks\n for each text region in the extracted images.\n TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION : The TesseractOCRTextDetection component\n will generate TEXT tracks by performing OCR on the text regions passed forward from the previous EAST action.\n KEYWORD TAGGING (WITH FF REGIONS) ACTION : The KeywordTagging component will take the TEXT tracks from the\n previous TIKA TEXT and TESSERACT OCR actions and perform keyword tagging. This will add the TAGS \n , TRIGGER_WORDS , and TRIGGER_WORDS_OFFSET properties to each track.\n OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION : The Markup component will take the keyword-tagged TEXT tracks for\n the derivative media and draw bounding boxes on the extracted images.",
+ "text": "Consider the following diagram which depicts a pipeline to accomplish generating TEXT tracks for PDFs which contain\nembedded raw character data and embedded images with text: Each block represents a single action performed in that stage of the pipeline. (Technically, a pipeline consists of\ntasks executed in sequence, but in this case each task consists of only one action, so we just show the actions.)\nActions that have SOURCE MEDIA ONLY in their name have the SOURCE_MEDIA_ONLY property set to TRUE , which will\nresult in completely skipping that action for derivative media. The component associated with the action will not\nreceive sub-job messages and there will be no representation of the action being executed on derivative media in the\nJSON output object. Similarly, actions that have DERIVATIVE MEDIA ONLY in their name have the DERIVATIVE_MEDIA_ONLY property set\nto TRUE , which will result in completely skipping that action for source media. Note that setting both properties\nto TRUE will result in skipping the action for both derivative and source media, which means it will never be\nexecuted. Not setting either property will result in executing the action on both source and derivative media, as you\nsee in the diagram with the KEYWORD TAGGING action. Note that the actions shown in the source media flow and derivative media flow are not executed at the same time.\nThe flows are shown in different rows in the diagram to illustrate the logical separation, not to illustrate\nconcurrency. To be clear, each action in the pipeline is executed sequentially. If an action is missing from a flow it\njust means that no sub-job messages are generated for that kind of media during that stage of the pipeline. If an action\nis shown in both flows then sub-jobs will be performed on both the source and derivative media during that stage. To break down each stage of this pipeline: TIKA IMAGE DETECTION ACTION : The TikaImageDetection component will extract images from PDFs (or other document\n formats) and place them in $MPF_HOME/share/tmp/derivative-media/ . One MEDIA track will be generated for\n each image and it will have DERIVATIVE_MEDIA_TEMP_PATH and PAGE_NUM track properties. If remote storage is enabled, the WFM will upload the objects to the object store after this action is performed.\n Refer to the Object Storage Guide for more information. The WFM will perform media inspection on the images at this time. Each piece of derivative media will have a parent media id set to the media id value of the source media. It will\n appear as media.parentMediaId in the JSON output object. For source media the value will be -1. Each piece of derivative media will have a media.mediaMetadata property of IS_DERIVATIVE_MEDIA set to TRUE .\n The metadata will also contain the PAGE_NUM property.\n TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION : The TikaTextDetection component will generate TEXT tracks by\n detecting the embedded raw character data in the PDF.\n EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION : The EastTextDetection component will generate TEXT REGION tracks\n for each text region in the extracted images.\n TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION : The TesseractOCRTextDetection component\n will generate TEXT tracks by performing OCR on the text regions passed forward from the previous EAST action.\n KEYWORD TAGGING (WITH FF REGIONS) ACTION : The KeywordTagging component will take the TEXT tracks from the\n previous TIKA TEXT and TESSERACT OCR actions and perform keyword tagging. This will add the TAGS \n , TRIGGER_WORDS , and TRIGGER_WORDS_OFFSET properties to each track. The action has the\n IS_ANNOTATOR property set to TRUE \n OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION : The Markup component will take the keyword-tagged TEXT tracks for\n the derivative media and draw bounding boxes on the extracted images.",
"title": "Text Detection Example"
},
{
- "location": "/Derivative-Media-Guide/index.html#task-merging",
- "text": "The large blue rectangles in the diagram represent tasks that are merged together. The purpose of task merging is to\nconsolidate how tracks are represented in the JSON output object by hiding redundant track information, and to make it\nappear that the behaviors of two or more actions are the result of a single algorithm. For example, keyword tagging behavior is supplemental to the text detection behavior. It's more important that TEXT \ntracks are associated with the algorithm that performed text detection than the KEYWORDTAGGING algorithm. Note that in\nour pipeline only the KEYWORD TAGGING action has the OUTPUT_MERGE_WITH_PREVIOUS_TASK property set to TRUE . It has\na similar effect in the source media flow and derivative media flow. In the source media flow the TIKA TEXT action is at the start of the merge chain while the KEYWORD TAGGING action is\nat the end of the merge chain. The tracks generated by the action at the end of the merge chain inherit the algorithm\nand track type from the tracks at the beginning of the merge chain. The effect is that in the JSON output object the\ntracks from the TIKA TEXT action will not be shown. Instead that action will be listed under TRACKS MERGED . The\ntracks from the KEYWORD TAGGING action will be shown with the TIKATEXT algorithm and TEXT track type: \"output\": {\n \"TRACKS MERGED\": [\n {\n \"source\": \"+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION\",\n \"algorithm\": \"TIKATEXT\"\n }\n ],\n \"MEDIA\": [\n {\n \"source\": \"+#TIKA IMAGE DETECTION ACTION\",\n \"algorithm\": \"TIKAIMAGE\",\n \"tracks\": [ ... ]\n }\n ],\n \"TEXT\": [\n {\n \"source\": \"+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION\",\n \"algorithm\": \"TIKATEXT\",\n \"tracks\": [ ... ]\n }\n ]\n} In the derivative media flow the TESSERACT OCR action is at the start of the merge chain while the KEYWORD TAGGING \naction is at the end of the merge chain. The effect is that in the JSON output object the tracks from\nthe TESSERACT OCR action will not be shown. The tracks from the KEYWORD TAGGING action will be shown with\nthe TESSERACTOCR algorithm and TEXT track type: \"output\": {\n \"NO TRACKS\": [\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION#OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"MARKUPCV\"\n }\n ],\n \"TRACKS MERGED\": [\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"TESSERACTOCR\"\n }\n ],\n \"TEXT\": [\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION\",\n \"algorithm\": \"TESSERACTOCR\",\n \"tracks\": [ ... ]\n }\n ],\n \"TEXT REGION\": [\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"EAST\",\n \"tracks\": [ ... ]\n }\n ]\n} Note that a MARKUP action will never generate new tracks. It simply fills out the media.markupResult field in the\nJSON output object (not shown above).",
- "title": "Task Merging"
- },
- {
- "location": "/Derivative-Media-Guide/index.html#output-last-task-only",
- "text": "If you want to omit all tracks from the JSON output object but the respective TEXT tracks for the source and\nderivative media, then in you can also set the OUTPUT_LAST_TASK_ONLY job property to TRUE . Note that the WFM only\nconsiders tasks that use DETECTION algorithms as the final task, so MARKUP is ignored. Setting this property will\nresult in the following JSON for the source media: \"output\": {\n \"TRACKS SUPPRESSED\": [\n {\n \"source\": \"+#TIKA IMAGE DETECTION ACTION\",\n \"algorithm\": \"TIKAIMAGE\"\n },\n {\n \"source\": \"+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION\",\n \"algorithm\": \"TIKATEXT\"\n }\n ],\n \"TEXT\": [\n {\n \"source\": \"+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION\",\n \"algorithm\": \"TIKATEXT\", \n \"tracks\": [ ... ]\n }\n ]\n} And the following JSON for the derivative media: \"output\": {\n \"NO TRACKS\": [\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION#OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"MARKUPCV\"\n }\n ],\n \"TRACKS SUPPRESSED\": [\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"EAST\"\n },\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"TESSERACTOCR\"\n }\n ],\n \"TEXT\": [\n {\n \"source\": \"+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION\",\n \"algorithm\": \"TESSERACTOCR\",\n \"tracks\": [ ... ]\n }\n ]\n}",
- "title": "Output Last Task Only"
+ "location": "/Derivative-Media-Guide/index.html#annotators",
+ "text": "When a pipeline does not use derivative media, an action with IS_ANNOTATOR=true , always annotates\nthe action immediately proceeding it. When a pipeline uses derivative media, an action with IS_ANNOTATOR=true annotates the last action that was applicable to the media type. In the example\nabove, KEYWORD TAGGING action, has IS_ANNOTATOR=true . When determining which action KEYWORD TAGGING annotates in the source media flow, the TESSERACT OCR and EAST actions are considered, but are not selected because neither applies to\nthe source media. The TIKA TEXT action is considered and then selected because it applies to the\nsource media. Below is example output for the source media. The tracks contained in the TEXT \nsection will include the properties added by the TIKA TEXT action and the properties added by the KEYWORD TAGGING action. {\n \"output\": {\n \"MEDIA\": [\n {\n \"action\": \"TIKA IMAGE DETECTION ACTION\",\n \"algorithm\": \"TIKAIMAGE\",\n \"annotators\": [],\n \"tracks\": [\"...\"]\n }\n ],\n \"TEXT\": [\n {\n \"action\": \"TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION\",\n \"algorithm\": \"TIKATEXT\",\n \"annotators\": [\"KEYWORD TAGGING (WITH FF REGION) ACTION\"],\n \"tracks\": [\"...\"]\n }\n ]\n }\n} When determining which action KEYWORD TAGGING annotates in the derivative media flow, TESSERACT OCR is selected because it is the first action before KEYWORD TAGGING that applies to\nderivative media. Below is example output for the derivative media. The tracks contained in the TEXT section will include the properties added by the TESSERACT OCR action and the properties\nadded by the KEYWORD TAGGING action. {\n \"output\": {\n \"NO TRACKS\": [\n {\n \"action\": \"OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"MARKUPCV\",\n \"annotators\": [],\n }\n ],\n \"TEXT\": [\n {\n \"action\": \"TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"TESSERACTOCR\",\n \"annotators\": [\"KEYWORD TAGGING (WITH FF REGION) ACTION\"],\n \"tracks\": [\"...\"]\n }\n ],\n \"TEXT REGION\": [\n {\n \"action\": \"EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION\",\n \"algorithm\": \"EAST\",\n \"annotators\": [],\n \"tracks\": [\"...\"]\n }\n ]\n }\n} Note that a MARKUP action will never generate new tracks. It simply fills out the media.markupResult field in the JSON output object (not shown above).",
+ "title": "Annotators"
},
{
"location": "/Derivative-Media-Guide/index.html#developing-media-extraction-components",
@@ -617,7 +612,7 @@
},
{
"location": "/Artifact-Extraction-Guide/index.html",
- "text": "NOTICE:\n This software (or technical data) was produced for the U.S. Government under contract, and is subject to the\nRights in Data-General Clause 52.227-14, Alt. IV (DEC 2007). Copyright 2024 The MITRE Corporation. All Rights Reserved.\n\n\nIntroduction\n\n\nArtifact extraction is an optional behavior of OpenMPF that allows the user to save artifacts from a job onto disk. An\nartifact is a frame region extracted from a piece of media. Extracting artifacts gives you a way to visualize representative\ndetections from the tracks found in a piece of media. For example, you might want to extract an artifact for the\nexemplar in all tracks found in a piece of media. The exemplar for the track is the detection in the track that has the\nhighest value for the detection property chosen with the \nQUALITY_SELECTION_PROPERTY\n. (Refer to the \nQuality Selection Guide\n for documentation on quality selection.) In another scenario, you might want to extract\nartifacts for the exemplar as well as a few other frames that come before and after the exemplar.\n\n\nThe Workflow Manager performs artifact extraction after all detection processing for a job is complete. Artifacts can be\nwritten to local shared storage, or to a remote S3 storage location. Refer to the \nObject Storage Guide\n for information on using object storage.\n\n\nArtifact Extraction Properties\n\n\nThe choice of which artifacts to extract is highly configurable using the following properties.\n\n\n\n\nARTIFACT_EXTRACTION_POLICY\n: This property sets the high level policy controlling artifact extraction. It must have\none of the following values:\n\n\nNONE\n: No artifact extraction will be performed.\n\n\nVISUAL_TYPES_ONLY\n: Extract artifacts only for tracks associated with a \"visual\" data type. Visual data types\n include \nIMAGE\n and \nVIDEO\n.\n\n\nALL_TYPES\n: Extract artifacts regardless of data type.\n\n\nALL_DETECTIONS\n: Extract artifacts for all detections in the track.\n\n\n\n\n\n\n\n\nThe default value is \nVISUAL_TYPES_ONLY\n, which turns off artifact extraction for data types such as \nMOTION\n,\n\nSPEECH\n, \nSCENE\n, and \nSOUND\n. [\nNOTE:\n Artifact extraction for anything other that \nIMAGE\n or \nVIDEO\n is not currently\nsupported and will result in an error for the job.]\n\n\nWith the \nVISUAL_TYPES_ONLY\n or \nALL_TYPES\n policy, artifacts will be extracted according to the\n\nARTIFACT_EXTRACTION_POLICY_*\n properties described below. With the \nNONE\n and \nALL_DETECTIONS\n policies, these\nproperties are ignored.\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY_CROPPING\n: When set to true, causes the extracted artifact to\nbe cropped to the width and height of the bounding box of the detection, instead of extracting the entire frame.\nDefault value is \ntrue\n.\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY_EXEMPLAR_FRAME_PLUS\n: This property may be set to an integer value N, which causes\nthe exemplar frame and N frames before and after the exemplar to be extracted. If N = 0, then only\nthe exemplar will be extracted. If N > 0, then the exemplar plus N frames before and after it will be extracted.\nIf N < 0, then this property is disabled. The default value is 0.\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY_FIRST_FRAME\n: When set to true, then detections in the first frame in each track will\nbe extracted. The default value is \nfalse\n.\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY_MIDDLE_FRAME\n: When set to true, then detections in the frame closest to the middle of\neach track will be extracted. The middle frame is the frame that is equally distant from the start and stop frames,\nbut that frame does not necessarily contain a detection in a given track, so we search for the detection in the track\nthat is closest to that middle frame. The default value is \nfalse\n.\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY_LAST_FRAME\n: When set to true, then detections in the last frame in each track will\nbe extracted. The default value is \nfalse\n.\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY_TOP_QUALITY_COUNT\n: When this property is set to an integer value N greater than 0\nthe detections in a track will be sorted by the detection property given by the \nQUALITY_SELECTION_PROPERTY\n job\nproperty, and then the N detections with the highest quality will be extracted, up to the number of available\ndetections. If N is less than or equal to 0, then this policy is disabled. The default value is 0. (Refer to the \nQuality Selection Guide\n for documentation on quality selection.)\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY_BEST_DETECTION_PROP_NAMES_LIST\n: This property may be set to a string comprised of a\nsemi-colon delimited list of strings. The strings in this list define the detection property names to be used for\nartifact extraction. If a detection in a track has a property that corresponds to any of the names in this list,\nan artifact will be created for it. For example, you might have a component that finds the detection in each track that has the largest size (width x height), and adds a property to that detection named \nBEST_SIZE\n. You could then set this property to the string \nBEST_SIZE\n and artifacts for those detections will be extracted along with all others requested. If the string is empty, then this property is disabled. The default value is\nthe empty string.\n\n\n\n\n\n\nCombining Properties\n\n\nThe above properties can be combined to satisfy a set of criteria. For example, if \nARTIFACT_EXTRACTION_POLICY_FIRST_FRAME\n is set to \ntrue\n and \nARTIFACT_EXTRACTION_POLICY_MIDDLE_FRAME\n is set to \ntrue\n, then the first and middle frames will be extracted. If \nARTIFACT_EXTRACTION_POLICY_CROPPING\n is also set to true then the detection crops for the first and middle frames will be extracted instead of the whole frames. If \nARTIFACT_EXTRACTION_POLICY_TOP_QUALITY_COUNT\n is also set to 5, then the above still applies, and the 5 detection crops in a track with the highest quality values will also be extracted. Note that the top 5 may already include the first and middle frames. An artifact will only ever be extracted once, even if it is chosen according to more than one of the artifact extraction policies.\n\n\nArtifact Cropping and Rotation\n\n\nIf the \nARTIFACT_EXTRACTION_POLICY_CROPPING\n job property is set to true, then the bounding box in the detection object is used to define the cropping. Here is an example showing an image where two detections were found. The two detections in the frame are illustrated with the bounding boxes added by markup. (Refer to the \nMarkup Guide\n for documentation on how markup is added to images and videos.) The cropped artifacts are also shown below. Notice that the detection on the left in the image is rotated, but the cropped artifact has had the rotation removed.\n\n\nImage with Markup\n\n\n\n\nLeft Detection\n\n\n\n\nRight Detection",
+ "text": "NOTICE:\n This software (or technical data) was produced for the U.S. Government under contract, and is subject to the\nRights in Data-General Clause 52.227-14, Alt. IV (DEC 2007). Copyright 2024 The MITRE Corporation. All Rights Reserved.\n\n\nIntroduction\n\n\nArtifact extraction is an optional behavior of OpenMPF that allows the user to save artifacts from a job onto disk. An\nartifact is a frame region extracted from a piece of media. Extracting artifacts gives you a way to visualize representative\ndetections from the tracks found in a piece of media. For example, you might want to extract an artifact for the\nexemplar in all tracks found in a piece of media. The exemplar for the track is the detection in the track that has the\nhighest value for the detection property chosen with the \nQUALITY_SELECTION_PROPERTY\n. (Refer to the \nQuality Selection Guide\n for documentation on quality selection.) In another scenario, you might want to extract\nartifacts for the exemplar as well as a few other frames that come before and after the exemplar.\n\n\nThe Workflow Manager performs artifact extraction after all detection processing for a job is complete. Artifacts can be\nwritten to local shared storage, or to a remote S3 storage location. Refer to the \nObject Storage Guide\n for information on using object storage.\n\n\nArtifact Extraction Properties\n\n\nThe choice of which artifacts to extract is highly configurable using the following properties.\n\n\n\n\n\n\nSUPPRESS_TRACKS\n: When an action has this property set to \ntrue\n, no artifacts for that action\n will be extracted and none of the other properties listed below will have any effect.\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY\n: This property sets the high level policy controlling artifact extraction. It must have\none of the following values:\n\n\n\n\nNONE\n: No artifact extraction will be performed.\n\n\nVISUAL_TYPES_ONLY\n: Extract artifacts only for tracks associated with a \"visual\" data type. Visual data types\n include \nIMAGE\n and \nVIDEO\n.\n\n\nALL_TYPES\n: Extract artifacts regardless of data type.\n\n\nALL_DETECTIONS\n: Extract artifacts for all detections in the track.\n\n\n\n\n\n\n\n\nThe default value is \nVISUAL_TYPES_ONLY\n, which turns off artifact extraction for data types such as \nMOTION\n,\n\nSPEECH\n, \nSCENE\n, and \nSOUND\n. [\nNOTE:\n Artifact extraction for anything other that \nIMAGE\n or \nVIDEO\n is not currently\nsupported and will result in an error for the job.]\n\n\nWith the \nVISUAL_TYPES_ONLY\n or \nALL_TYPES\n policy, artifacts will be extracted according to the\n\nARTIFACT_EXTRACTION_POLICY_*\n properties described below. With the \nNONE\n and \nALL_DETECTIONS\n policies, these\nproperties are ignored.\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY_CROPPING\n: When set to true, causes the extracted artifact to\nbe cropped to the width and height of the bounding box of the detection, instead of extracting the entire frame.\nDefault value is \ntrue\n.\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY_EXEMPLAR_FRAME_PLUS\n: This property may be set to an integer value N, which causes\nthe exemplar frame and N frames before and after the exemplar to be extracted. If N = 0, then only\nthe exemplar will be extracted. If N > 0, then the exemplar plus N frames before and after it will be extracted.\nIf N < 0, then this property is disabled. The default value is 0.\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY_FIRST_FRAME\n: When set to true, then detections in the first frame in each track will\nbe extracted. The default value is \nfalse\n.\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY_MIDDLE_FRAME\n: When set to true, then detections in the frame closest to the middle of\neach track will be extracted. The middle frame is the frame that is equally distant from the start and stop frames,\nbut that frame does not necessarily contain a detection in a given track, so we search for the detection in the track\nthat is closest to that middle frame. The default value is \nfalse\n.\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY_LAST_FRAME\n: When set to true, then detections in the last frame in each track will\nbe extracted. The default value is \nfalse\n.\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY_TOP_QUALITY_COUNT\n: When this property is set to an integer value N greater than 0\nthe detections in a track will be sorted by the detection property given by the \nQUALITY_SELECTION_PROPERTY\n job\nproperty, and then the N detections with the highest quality will be extracted, up to the number of available\ndetections. If N is less than or equal to 0, then this policy is disabled. The default value is 0. (Refer to the \nQuality Selection Guide\n for documentation on quality selection.)\n\n\n\n\n\n\nARTIFACT_EXTRACTION_POLICY_BEST_DETECTION_PROP_NAMES_LIST\n: This property may be set to a string comprised of a\nsemi-colon delimited list of strings. The strings in this list define the detection property names to be used for\nartifact extraction. If a detection in a track has a property that corresponds to any of the names in this list,\nan artifact will be created for it. For example, you might have a component that finds the detection in each track that has the largest size (width x height), and adds a property to that detection named \nBEST_SIZE\n. You could then set this property to the string \nBEST_SIZE\n and artifacts for those detections will be extracted along with all others requested. If the string is empty, then this property is disabled. The default value is\nthe empty string.\n\n\n\n\n\n\nCombining Properties\n\n\nThe above properties can be combined to satisfy a set of criteria. For example, if \nARTIFACT_EXTRACTION_POLICY_FIRST_FRAME\n is set to \ntrue\n and \nARTIFACT_EXTRACTION_POLICY_MIDDLE_FRAME\n is set to \ntrue\n, then the first and middle frames will be extracted. If \nARTIFACT_EXTRACTION_POLICY_CROPPING\n is also set to true then the detection crops for the first and middle frames will be extracted instead of the whole frames. If \nARTIFACT_EXTRACTION_POLICY_TOP_QUALITY_COUNT\n is also set to 5, then the above still applies, and the 5 detection crops in a track with the highest quality values will also be extracted. Note that the top 5 may already include the first and middle frames. An artifact will only ever be extracted once, even if it is chosen according to more than one of the artifact extraction policies.\n\n\nArtifact Cropping and Rotation\n\n\nIf the \nARTIFACT_EXTRACTION_POLICY_CROPPING\n job property is set to true, then the bounding box in the detection object is used to define the cropping. Here is an example showing an image where two detections were found. The two detections in the frame are illustrated with the bounding boxes added by markup. (Refer to the \nMarkup Guide\n for documentation on how markup is added to images and videos.) The cropped artifacts are also shown below. Notice that the detection on the left in the image is rotated, but the cropped artifact has had the rotation removed.\n\n\nImage with Markup\n\n\n\n\nLeft Detection\n\n\n\n\nRight Detection",
"title": "Artifact Extraction Guide"
},
{
@@ -627,7 +622,7 @@
},
{
"location": "/Artifact-Extraction-Guide/index.html#artifact-extraction-properties",
- "text": "The choice of which artifacts to extract is highly configurable using the following properties. ARTIFACT_EXTRACTION_POLICY : This property sets the high level policy controlling artifact extraction. It must have\none of the following values: NONE : No artifact extraction will be performed. VISUAL_TYPES_ONLY : Extract artifacts only for tracks associated with a \"visual\" data type. Visual data types\n include IMAGE and VIDEO . ALL_TYPES : Extract artifacts regardless of data type. ALL_DETECTIONS : Extract artifacts for all detections in the track. The default value is VISUAL_TYPES_ONLY , which turns off artifact extraction for data types such as MOTION , SPEECH , SCENE , and SOUND . [ NOTE: Artifact extraction for anything other that IMAGE or VIDEO is not currently\nsupported and will result in an error for the job.] With the VISUAL_TYPES_ONLY or ALL_TYPES policy, artifacts will be extracted according to the ARTIFACT_EXTRACTION_POLICY_* properties described below. With the NONE and ALL_DETECTIONS policies, these\nproperties are ignored. ARTIFACT_EXTRACTION_POLICY_CROPPING : When set to true, causes the extracted artifact to\nbe cropped to the width and height of the bounding box of the detection, instead of extracting the entire frame.\nDefault value is true . ARTIFACT_EXTRACTION_POLICY_EXEMPLAR_FRAME_PLUS : This property may be set to an integer value N, which causes\nthe exemplar frame and N frames before and after the exemplar to be extracted. If N = 0, then only\nthe exemplar will be extracted. If N > 0, then the exemplar plus N frames before and after it will be extracted.\nIf N < 0, then this property is disabled. The default value is 0. ARTIFACT_EXTRACTION_POLICY_FIRST_FRAME : When set to true, then detections in the first frame in each track will\nbe extracted. The default value is false . ARTIFACT_EXTRACTION_POLICY_MIDDLE_FRAME : When set to true, then detections in the frame closest to the middle of\neach track will be extracted. The middle frame is the frame that is equally distant from the start and stop frames,\nbut that frame does not necessarily contain a detection in a given track, so we search for the detection in the track\nthat is closest to that middle frame. The default value is false . ARTIFACT_EXTRACTION_POLICY_LAST_FRAME : When set to true, then detections in the last frame in each track will\nbe extracted. The default value is false . ARTIFACT_EXTRACTION_POLICY_TOP_QUALITY_COUNT : When this property is set to an integer value N greater than 0\nthe detections in a track will be sorted by the detection property given by the QUALITY_SELECTION_PROPERTY job\nproperty, and then the N detections with the highest quality will be extracted, up to the number of available\ndetections. If N is less than or equal to 0, then this policy is disabled. The default value is 0. (Refer to the Quality Selection Guide for documentation on quality selection.) ARTIFACT_EXTRACTION_POLICY_BEST_DETECTION_PROP_NAMES_LIST : This property may be set to a string comprised of a\nsemi-colon delimited list of strings. The strings in this list define the detection property names to be used for\nartifact extraction. If a detection in a track has a property that corresponds to any of the names in this list,\nan artifact will be created for it. For example, you might have a component that finds the detection in each track that has the largest size (width x height), and adds a property to that detection named BEST_SIZE . You could then set this property to the string BEST_SIZE and artifacts for those detections will be extracted along with all others requested. If the string is empty, then this property is disabled. The default value is\nthe empty string.",
+ "text": "The choice of which artifacts to extract is highly configurable using the following properties. SUPPRESS_TRACKS : When an action has this property set to true , no artifacts for that action\n will be extracted and none of the other properties listed below will have any effect. ARTIFACT_EXTRACTION_POLICY : This property sets the high level policy controlling artifact extraction. It must have\none of the following values: NONE : No artifact extraction will be performed. VISUAL_TYPES_ONLY : Extract artifacts only for tracks associated with a \"visual\" data type. Visual data types\n include IMAGE and VIDEO . ALL_TYPES : Extract artifacts regardless of data type. ALL_DETECTIONS : Extract artifacts for all detections in the track. The default value is VISUAL_TYPES_ONLY , which turns off artifact extraction for data types such as MOTION , SPEECH , SCENE , and SOUND . [ NOTE: Artifact extraction for anything other that IMAGE or VIDEO is not currently\nsupported and will result in an error for the job.] With the VISUAL_TYPES_ONLY or ALL_TYPES policy, artifacts will be extracted according to the ARTIFACT_EXTRACTION_POLICY_* properties described below. With the NONE and ALL_DETECTIONS policies, these\nproperties are ignored. ARTIFACT_EXTRACTION_POLICY_CROPPING : When set to true, causes the extracted artifact to\nbe cropped to the width and height of the bounding box of the detection, instead of extracting the entire frame.\nDefault value is true . ARTIFACT_EXTRACTION_POLICY_EXEMPLAR_FRAME_PLUS : This property may be set to an integer value N, which causes\nthe exemplar frame and N frames before and after the exemplar to be extracted. If N = 0, then only\nthe exemplar will be extracted. If N > 0, then the exemplar plus N frames before and after it will be extracted.\nIf N < 0, then this property is disabled. The default value is 0. ARTIFACT_EXTRACTION_POLICY_FIRST_FRAME : When set to true, then detections in the first frame in each track will\nbe extracted. The default value is false . ARTIFACT_EXTRACTION_POLICY_MIDDLE_FRAME : When set to true, then detections in the frame closest to the middle of\neach track will be extracted. The middle frame is the frame that is equally distant from the start and stop frames,\nbut that frame does not necessarily contain a detection in a given track, so we search for the detection in the track\nthat is closest to that middle frame. The default value is false . ARTIFACT_EXTRACTION_POLICY_LAST_FRAME : When set to true, then detections in the last frame in each track will\nbe extracted. The default value is false . ARTIFACT_EXTRACTION_POLICY_TOP_QUALITY_COUNT : When this property is set to an integer value N greater than 0\nthe detections in a track will be sorted by the detection property given by the QUALITY_SELECTION_PROPERTY job\nproperty, and then the N detections with the highest quality will be extracted, up to the number of available\ndetections. If N is less than or equal to 0, then this policy is disabled. The default value is 0. (Refer to the Quality Selection Guide for documentation on quality selection.) ARTIFACT_EXTRACTION_POLICY_BEST_DETECTION_PROP_NAMES_LIST : This property may be set to a string comprised of a\nsemi-colon delimited list of strings. The strings in this list define the detection property names to be used for\nartifact extraction. If a detection in a track has a property that corresponds to any of the names in this list,\nan artifact will be created for it. For example, you might have a component that finds the detection in each track that has the largest size (width x height), and adds a property to that detection named BEST_SIZE . You could then set this property to the string BEST_SIZE and artifacts for those detections will be extracted along with all others requested. If the string is empty, then this property is disabled. The default value is\nthe empty string.",
"title": "Artifact Extraction Properties"
},
{
@@ -677,7 +672,7 @@
},
{
"location": "/Media-Selectors-Guide/index.html",
- "text": "NOTICE:\n This software (or technical data) was produced for the U.S. Government under contract,\nand is subject to the Rights in Data-General Clause 52.227-14, Alt. IV (DEC 2007). Copyright 2025\nThe MITRE Corporation. All Rights Reserved.\n\n\nMedia Selectors Overview\n\n\nMedia selectors allow users to specify that only specific sections of a document should be\nprocessed. A copy of the input file with the specified sections replaced by component output is\nproduced.\n\n\nNew Job Request Fields\n\n\nBelow is an example of a job that uses \nJSON_PATH\n media selectors. The job uses a two-stage\npipeline. The first stage performs language identification. The second performs translation.\n\n\n{\n \"algorithmProperties\": {},\n \"buildOutput\": true,\n \"jobProperties\": {},\n \"media\": [\n {\n \"mediaUri\": \"file:///opt/mpf/share/remote-media/test-json-path-translation.json\",\n \"properties\": {},\n \"mediaSelectorsOutputAction\": \"ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) ACTION\",\n \"mediaSelectors\": [\n {\n \"type\": \"JSON_PATH\",\n \"expression\": \"$.spanishMessages.*.content\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n },\n {\n \"type\": \"JSON_PATH\",\n \"expression\": \"$.chineseMessages.*.content\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n }\n ]\n }\n ],\n \"pipelineName\": \"ARGOS TRANSLATION (WITH FASTTEXT LANGUAGE ID) TEXT FILE PIPELINE\",\n \"priority\": 4\n}\n\n\n\n\n\n$.media.*.mediaSelectorsOutputAction\n: Name of the action that produces content for the media\n selectors output file. In the above example, we specify that we want the translated content\n from Argos in the media selectors output file rather than the detected language from the first\n stage.\n\n\n$.media.*.mediaSelectors\n: List of media selectors that will be used for the media.\n\n\n$.media.*.mediaSelectors.*.type\n: The name of the \ntype of media selector\n\n that is used in the \nexpression\n field.\n\n\n$.media.*.mediaSelectors.*.expression\n: A string specifying the sections of the document that\n should be processed. The \ntype\n field specifies the syntax of the expression.\n\n\n$.media.*.mediaSelectors.*.resultDetectionProperty\n: A detection property name from tracks\n produced by the \nmediaSelectorsOutputAction\n. The media selectors output document will be\n populated with the content of the specified property.\n\n\n$.media.*.mediaSelectors.*.selectionProperties\n: Job properties that will only be used for\n sub-jobs created for a specific media selector. For example, when performing Argos translation\n on a JSON file in a single-stage pipeline without an upstream language detection stage, this\n could set \nDEFAULT_SOURCE_LANGUAGE=es\n for some media selectors and\n \nDEFAULT_SOURCE_LANGUAGE=zh\n for others.\n\n\n\n\nNew Job Properties\n\n\n\n\n\n\nMEDIA_SELECTORS_DELIMETER\n: When not provided and a job uses media selectors, the selected parts\n of the document will be replaced with the action output. When provided, the selected parts of\n the document will contain the original content, followed by the value of this property, and\n finally the action output.\n\n\n\n\n\n\nMEDIA_SELECTORS_DUPLICATE_POLICY\n: Specifies how to handle the case where a job uses media\n selectors and there are multiple outputs for a single selection. When set to \nLONGEST\n, the\n longer of the two outputs is chosen and the shorter one is discarded. When set to \nERROR\n,\n duplicates are considered an error. When set to \nJOIN\n, the duplicates are combined using\n \n|\n as a delimiter.\n\n\n\n\n\n\nMEDIA_SELECTORS_NO_MATCHES_IS_ERROR\n: When true and a job uses media selectors, an error will be\n generated when none of the selectors match content from the media.\n\n\n\n\n\n\nMedia Selector Types\n\n\nJSON_PATH\n and \nCSV_COLS\n are currently supported.\n\n\nJSON_PATH\n\n\nUsed to extract content for JSON files. Uses the \"Jayway JsonPath\" library to parse the expressions.\nThe specific syntax supported is available on their\n\nGitHub page\n. JsonPath\nexpressions are case-sensitive.\n\n\nWhen extracting content from the document, only strings, arrays, and objects are considered. All\nother JSON types are ignored. When the JsonPath expression matches an array, each element is\nrecursively explored. When the expression matches an object, keys are left unchanged and each value\nof the object is recursively explored.\n\n\nJSON_PATH Matching Example\n\n\n{\n \"key1\": [\"a\", \"b\", \"c\"],\n \"key2\": {\n \"key3\": [\n {\n \"key4\": [\"d\", \"e\"],\n \"key5\": [\"f\", \"g\"],\n \"key6\": 6\n }\n ]\n }\n}\n\n\n\n\n\n\n\n\n\nExpression\n\n\nMatches\n\n\n\n\n\n\n\n\n\n\n$\n\n\na, b, c, d, e, f, g\n\n\n\n\n\n\n$.*\n\n\na, b, c, d, e, f, g\n\n\n\n\n\n\n$.key1\n\n\na, b, c\n\n\n\n\n\n\n$.key1[0]\n\n\na\n\n\n\n\n\n\n$.key2\n\n\nd, e, f, g\n\n\n\n\n\n\n$.key2.key3\n\n\nd, e, f, g\n\n\n\n\n\n\n$.key2.key3.*.key4\n\n\nd, e\n\n\n\n\n\n\n$.key2.key3.*.*[0]\n\n\nd, f\n\n\n\n\n\n\n\n\nJSON_PATH Output File\n\n\nWhen media selectors are used, the JsonOutputObject will contain a URI referencing the file\nlocation in the \n$.media.*.mediaSelectorsOutputUri\n field.\n\n\nFor example, consider that the \nmediaUri\n from the job in the\n\nNew Job Request Fields section\n refers to the document below.\n\n\n{\n \"otherStuffKey\": [\"other stuff value\"],\n \"spanishMessages\": [\n {\n \"to\": \"spanish recipient 1\",\n \"from\": \"spanish sender 1\",\n \"content\": \"\u00bfHola, c\u00f3mo est\u00e1s?\"\n },\n {\n \"to\": \"spanish recipient 2\",\n \"from\": \"spanish sender 2\",\n \"content\": \"\u00bfD\u00f3nde est\u00e1 la biblioteca?\"\n }\n ],\n \"chineseMessages\": [\n {\n \"to\": \"chinese recipient 1\",\n \"from\": \"chinese sender 1\",\n \"content\": \"\u73b0\u5728\u662f\u51e0\u594c\uff1f\"\n },\n {\n \"to\": \"chinese recipient 2\",\n \"from\": \"chinese sender 2\",\n \"content\": \"\u4f60\u53eb\u4ec0\u4e48\u540d\u5b57\uff1f\"\n },\n {\n \"to\": \"chinese recipient 3\",\n \"from\": \"chinese sender 3\",\n \"content\": \"\u4f60\u5728\u54ea\u91cc\uff1f\"\n }\n ]\n}\n\n\n\nThe \nmediaSelectorsOutputUri\n field will refer to a document containing the content below.\n\n\n{\n \"otherStuffKey\": [\"other stuff value\"],\n \"spanishMessages\": [\n {\n \"to\": \"spanish recipient 1\",\n \"from\": \"spanish sender 1\",\n \"content\": \"Hello, how are you?\"\n },\n {\n \"to\": \"spanish recipient 2\",\n \"from\": \"spanish sender 2\",\n \"content\": \"Where is the library?\"\n }\n ],\n \"chineseMessages\": [\n {\n \"to\": \"chinese recipient 1\",\n \"from\": \"chinese sender 1\",\n \"content\": \"What time is it?\"\n },\n {\n \"to\": \"chinese recipient 2\",\n \"from\": \"chinese sender 2\",\n \"content\": \"What is your name?\"\n },\n {\n \"to\": \"chinese recipient 3\",\n \"from\": \"chinese sender 3\",\n \"content\": \"Where are you?\"\n }\n ]\n}\n\n\n\nIf \nMEDIA_SELECTORS_DELIMETER\n was set to \" | Translation: \", the file would contain the content\nbelow.\n\n\n{\n \"otherStuffKey\": [\"other stuff value\"],\n \"spanishMessages\": [\n {\n \"to\": \"spanish recipient 1\",\n \"from\": \"spanish sender 1\",\n \"content\": \"\u00bfHola, c\u00f3mo est\u00e1s? | Translation: Hello, how are you?\"\n },\n {\n \"to\": \"spanish recipient 2\",\n \"from\": \"spanish sender 2\",\n \"content\": \"\u00bfD\u00f3nde est\u00e1 la biblioteca? | Translation: Where is the library?\"\n }\n ],\n \"chineseMessages\": [\n {\n \"to\": \"chinese recipient 1\",\n \"from\": \"chinese sender 1\",\n \"content\": \"\u73b0\u5728\u662f\u51e0\u594c\uff1f | Translation: What time is it?\"\n },\n {\n \"to\": \"chinese recipient 2\",\n \"from\": \"chinese sender 2\",\n \"content\": \"\u4f60\u53eb\u4ec0\u4e48\u540d\u5b57\uff1f | Translation: What is your name?\"\n },\n {\n \"to\": \"chinese recipient 3\",\n \"from\": \"chinese sender 3\",\n \"content\": \"\u4f60\u5728\u54ea\u91cc\uff1f | Translation: Where are you?\"\n }\n ]\n}\n\n\n\nCSV_COLS\n\n\nUsed to extract content from specific columns of a CSV file. The expression itself must be a\nsingle row of CSV listing the columns to extract. The \nCSV_SELECTORS_ARE_INDICES\n job property\ncontrols whether the entries refer to column names or zero-based integer indices.\n\n\nCSV-Specific Job Properties\n\n\n\n\n\n\nCSV_SELECTORS_ARE_INDICES\n: When \nFALSE\n (the default), the selector expression must contain\n column names. When \nTRUE\n the selector should contain the zero-based integer indices of the\n columns that should be processed.\n\n\n\n\n\n\nCSV_CSV_FIRST_ROW_IS_DATA\n: When \nFALSE\n (the default), the first row is considered headers and\n will not be processed. When \nTRUE\n, the first row is considered data and the first row will be\n processed.\n\n\n\n\n\n\nAn issue when processing CSV is that sometimes the first row is considered headers (a.k.a. column\nnames) and in others the first row is actually data and there are no headers.\n\n\n\n\n\n\nIn the default configuration (\nCSV_SELECTORS_ARE_INDICES\n = \nFALSE\n and\n \nCSV_FIRST_ROW_IS_DATA\n = \nFALSE\n), the selector expression refers to column names and the\n first row is not processed as data.\n\n\n\n\n\n\nIf the first row is actual data, and you want to specify the columns by index instead of by\n name in the selector expression, set \nCSV_SELECTORS_ARE_INDICES\n = \nTRUE\n and\n \nCSV_FIRST_ROW_IS_DATA\n = \nTRUE\n.\n\n\n\n\n\n\nIf the first row is headers, and you want to specify the columns by index instead of by\n name in the selector expression, set \nCSV_SELECTORS_ARE_INDICES\n = \nTRUE\n and\n \nCSV_FIRST_ROW_IS_DATA\n = \nFALSE\n.\n\n\n\n\n\n\nCSV_COLS Matching Example\n\n\nThe table below shows combinations of values for \nCSV_SELECTORS_ARE_INDICES\n and\n\nCSV_FIRST_ROW_IS_DATA\n when matched against this CSV content:\n\n\nheader0,header1,\"header,2\"\na,b,c\nd,e,f,g\n\n\n\n\n\n\n\n\n\nExpression\n\n\nCSV_SELECTORS_\nARE_INDICES\n\n\nCSV_FIRST_ROW_\nIS_DATA\n\n\nMatches\n\n\n\n\n\n\n\n\n\n\nheader0,\"header,2\"\n\n\nFALSE\n\n\nFALSE\n\n\na, c, d, f\n\n\n\n\n\n\nheader0,\"header,2\"\n\n\nFALSE\n\n\nTRUE\n\n\nheader0, \"header,2\", a, c, d, f\n\n\n\n\n\n\nheader0,headerX\n\n\nFALSE\n\n\nTRUE / FALSE\n\n\nError\n: \"headerX\" does not exist\n\n\n\n\n\n\nheader0,header,2\n\n\nFALSE\n\n\nTRUE / FALSE\n\n\nError\n: \"header\" and \"2\" do not exist\n\n\n\n\n\n\nheader0,\"header,2\"\n\n\nTRUE\n\n\nTRUE / FALSE\n\n\nError\n: The expression contains non-integers.\n\n\n\n\n\n\n0,2\n\n\nTRUE\n\n\nFALSE\n\n\na, c, d, f\n\n\n\n\n\n\n0,2\n\n\nTRUE\n\n\nTRUE\n\n\nheader0, \"header,2\", a, c, d, f\n\n\n\n\n\n\n0,3,4\n\n\nTRUE\n\n\nFALSE\n\n\na, d, g\n\n\n\n\n\n\n0,2\n\n\nFALSE\n\n\nTRUE / FALSE\n\n\nError\n: There are no columns with \"0\" or \"2\" as the header.\n\n\n\n\n\n\n\n\nCSV Text Encodings\n\n\nWe recommend submitting UTF-8-encoded CSV files, but we do attempt to recognize other text\nencodings. When attempting to determine the input file encoding, the Workflow Manager will inspect\nthe first 12,000 bytes of the file. If all of the 12,000 bytes are valid UTF-8 bytes, then the\nWorkflow Manager will treat the file as UTF-8. Otherwise, the Workflow Manager will use\n\nTika's \nCharsetDetector\n\nto determine the encoding.\n\n\nThe media selectors output file will always be UTF-8-encoded. If the input file was UTF-8-encoded\nand had a byte-order mark, then a byte-order mark will be added to the output file.\n\n\nByte-order mark\n\n\nThe UTF-8, UTF-16, and UTF-32 text encodings may have a byte-order mark present. The byte-order\nmark is the Unicode character named \"ZERO WIDTH NO-BREAK SPACE\" with a code point of U+FEFF. Each\nencoding will encode it as different bytes. For example, in UTF-8 it is encoded with three bytes:\n\n0xEF\n, \n0xBB\n, \n0xBF\n.\n\n\nMany CSV parsers do not have special handling for the byte-order mark. They just treat it as a\nnormal character and consider it to be the first character in the first cell. The Workflow Manager\ndiscards the byte-order mark before parsing the CSV.\n\n\nExcel\n\n\n\n\n\n\nIf you open a CSV file in Microsoft Excel and the text is garbled, you should open the file\n in a text editor that supports UTF-8 and see if the text is garbled there too.\n\n\n\n\n\n\nWhen saving a CSV file from Excel, if you select \"CSV (Comma delimited)(*.csv)\", Excel will\n silently replace East Asian characters with question marks. Selecting\n \"CSV UTF-8 (Comma delimited) (.csv)\" preserves the East Asian characters, but it adds a\n byte-order mark to the file.\n\n\n\n\n\n\nIf you open a UTF-8-encoded file in Excel, it will treat it as ISO-8859-1 unless the file has\n a UTF-8 byte-order mark.\n\n\n\n\n\n\nUsing a byte-order mark with UTF-8 is uncommon because the UTF-8 encoding does not have endianess\nlike UTF-16 and UTF-32. The byte-order mark added by Excel can cause problems because a\nlot of software does not expect it to be present.\n\n\nAs an example, consider an Excel spreadsheet with the following content:\n\n\n\n\n\n\n\n\nCol,1\n\n\nCol,2\n\n\n\n\n\n\n\n\n\n\nitem1\n\n\nitem2\n\n\n\n\n\n\n\n\nIf you save that as \"CSV UTF-8 (Comma delimited) (.csv)\" and then \ncat\n the file you will get:\n\n\n\"Col,1\",\"Col,2\"\nitem1,item2\n\n\n\nSince the two column names contain commas, the cells need to be escaped with quotation marks so that\nthe internal comma is not interpreted as a record separator. If you parse that file with\n\nPython's built-in CSV parser\n you get the following\nresult:\n\n\n\n\n\n\n\n\n\ufeff\"Col\n\n\n1\"\n\n\nCol,2\n\n\n\n\n\n\n\n\n\n\nitem1\n\n\nitem2\n\n\n\n\n\n\n\n\n\n\nWhile the first cell above appears to be four characters in length, it is actually five. The first\ncharacter is U+FEFF (ZERO WIDTH NO-BREAK SPACE). Since the first cell starts with\nU+FEFF (ZERO WIDTH NO-BREAK SPACE), rather than U+0022 (QUOTATION MARK), the first comma is\nunescaped so it is interpreted as the record separator.\n\n\nSince the byte-order mark is invisible when rendered, simply printing the content will not reveal\nthe issue. It is visible in a hexdump produced with \nhexdump -C\n on Linux or \nFormat-Hex\n in\nWindows Powershell. In the output from \nhexdump -C\n below, you can see that there is content before\nthe quotation mark and that content matches the UTF-8-encoded byte-order mark of \n0xEF\n, \n0xBB\n,\n\n0xBF\n.\n\n\n00000000 ef bb bf 22 43 6f 6c 2c 31 22 2c 22 43 6f 6c 2c |...\"Col,1\",\"Col,|\n00000010 32 22 0d 0a 69 74 65 6d 31 2c 69 74 65 6d 32 0d |2\"..item1,item2.|\n00000020 0a |.|\n\n\n\nCSV_COLS Output File\n\n\nWhen media selectors are used, a copy of the input file with the specified sections replaced by\ncomponent output is produced. The URI to the file will be present in the\n\n$.media.*.mediaSelectorsOutputUri\n field.\n\n\nBelow is an example of a job that uses \nCSV_COLS\n media selectors. The job uses a two-stage\npipeline. The first stage performs language identification. The second performs translation.\n\n\n{\n \"algorithmProperties\": {},\n \"buildOutput\": true,\n \"jobProperties\": {},\n \"media\": [\n {\n \"mediaUri\": \"file:///opt/mpf/share/remote-media/test-csv-translation.csv\",\n \"properties\": {},\n \"mediaSelectorsOutputAction\": \"ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) ACTION\",\n \"mediaSelectors\": [\n {\n \"type\": \"CSV_COLS\",\n \"expression\": \"Spanish\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n },\n {\n \"type\": \"CSV_COLS\",\n \"expression\": \"Chinese\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n }\n ]\n }\n ],\n \"pipelineName\": \"ARGOS TRANSLATION (WITH FASTTEXT LANGUAGE ID) TEXT FILE PIPELINE\",\n \"priority\": 4\n}\n\n\n\nThe input file, \ntest-csv-translation.csv\n, contains the content below.\n\n\nEnglish,Spanish,Chinese\n\"Hello, how are you?\",\"\u00bfHola, c\u00f3mo est\u00e1s?\",\u4f60\u597d\u5417\uff1f\nWhere is the library?,\u00bfD\u00f3nde est\u00e1 la biblioteca?,\u56fe\u4e66\u9986\u5728\u54ea\u91cc\uff1f\nWhat time is it?,\u00bfQu\u00e9 hora es?,\u73b0\u5728\u662f\u51e0\u594c\uff1f\n\n\n\nThe \nmediaSelectorsOutputUri\n field from the output object will refer to a document containing the\ncontent below.\n\n\nEnglish,Spanish,Chinese\n\"Hello, how are you?\",\"Hello, how are you?\",How are you?\nWhere is the library?,Where's the library?,Where's the library?\nWhat time is it?,What time is it?,What time is it?\n\n\n\nIf \nMEDIA_SELECTORS_DELIMETER\n was set to \" | Translation: \", the file would contain the content\nbelow.\n\n\nEnglish,Spanish,Chinese\n\"Hello, how are you?\",\"\u00bfHola, c\u00f3mo est\u00e1s? | Translation: Hello, how are you?\",\u4f60\u597d\u5417\uff1f | Translation: How are you?\nWhere is the library?,\u00bfD\u00f3nde est\u00e1 la biblioteca? | Translation: Where's the library?,\u56fe\u4e66\u9986\u5728\u54ea\u91cc\uff1f | Translation: Where's the library?\nWhat time is it?,\u00bfQu\u00e9 hora es? | Translation: What time is it?,\u73b0\u5728\u662f\u51e0\u594c\uff1f | Translation: What time is it?",
+ "text": "NOTICE:\n This software (or technical data) was produced for the U.S. Government under contract,\nand is subject to the Rights in Data-General Clause 52.227-14, Alt. IV (DEC 2007). Copyright 2025\nThe MITRE Corporation. All Rights Reserved.\n\n\nMedia Selectors Overview\n\n\nMedia selectors allow users to specify that only specific sections of a document should be\nprocessed. A copy of the input file with the specified sections replaced by component output is\nproduced.\n\n\nNew Job Request Fields\n\n\nBelow is an example of a job that uses \nJSON_PATH\n media selectors. The job uses a two-stage\npipeline. The first stage performs language identification. The second performs translation.\n\n\n{\n \"algorithmProperties\": {},\n \"buildOutput\": true,\n \"jobProperties\": {},\n \"media\": [\n {\n \"mediaUri\": \"file:///opt/mpf/share/remote-media/test-json-path-translation.json\",\n \"properties\": {},\n \"mediaSelectorsOutputAction\": \"ARGOS TRANSLATION (WITH FF REGION AND NOT ANNOTATOR) ACTION\",\n \"mediaSelectors\": [\n {\n \"type\": \"JSON_PATH\",\n \"expression\": \"$.spanishMessages.*.content\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n },\n {\n \"type\": \"JSON_PATH\",\n \"expression\": \"$.chineseMessages.*.content\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n }\n ]\n }\n ],\n \"pipelineName\": \"ARGOS TRANSLATION (WITH FASTTEXT LANGUAGE ID) TEXT FILE PIPELINE\",\n \"priority\": 4\n}\n\n\n\n\n\n$.media.*.mediaSelectorsOutputAction\n: Name of the action that produces content for the media\n selectors output file. In the above example, we specify that we want the translated content\n from Argos in the media selectors output file rather than the detected language from the first\n stage.\n\n\n$.media.*.mediaSelectors\n: List of media selectors that will be used for the media.\n\n\n$.media.*.mediaSelectors.*.type\n: The name of the \ntype of media selector\n\n that is used in the \nexpression\n field.\n\n\n$.media.*.mediaSelectors.*.expression\n: A string specifying the sections of the document that\n should be processed. The \ntype\n field specifies the syntax of the expression.\n\n\n$.media.*.mediaSelectors.*.resultDetectionProperty\n: A detection property name from tracks\n produced by the \nmediaSelectorsOutputAction\n. The media selectors output document will be\n populated with the content of the specified property.\n\n\n$.media.*.mediaSelectors.*.selectionProperties\n: Job properties that will only be used for\n sub-jobs created for a specific media selector. For example, when performing Argos translation\n on a JSON file in a single-stage pipeline without an upstream language detection stage, this\n could set \nDEFAULT_SOURCE_LANGUAGE=es\n for some media selectors and\n \nDEFAULT_SOURCE_LANGUAGE=zh\n for others.\n\n\n\n\nNew Job Properties\n\n\n\n\n\n\nMEDIA_SELECTORS_DELIMETER\n: When not provided and a job uses media selectors, the selected parts\n of the document will be replaced with the action output. When provided, the selected parts of\n the document will contain the original content, followed by the value of this property, and\n finally the action output.\n\n\n\n\n\n\nMEDIA_SELECTORS_DUPLICATE_POLICY\n: Specifies how to handle the case where a job uses media\n selectors and there are multiple outputs for a single selection. When set to \nLONGEST\n, the\n longer of the two outputs is chosen and the shorter one is discarded. When set to \nERROR\n,\n duplicates are considered an error. When set to \nJOIN\n, the duplicates are combined using\n \n|\n as a delimiter.\n\n\n\n\n\n\nMEDIA_SELECTORS_NO_MATCHES_IS_ERROR\n: When true and a job uses media selectors, an error will be\n generated when none of the selectors match content from the media.\n\n\n\n\n\n\nMedia Selector Types\n\n\nJSON_PATH\n and \nCSV_COLS\n are currently supported.\n\n\nJSON_PATH\n\n\nUsed to extract content for JSON files. Uses the \"Jayway JsonPath\" library to parse the expressions.\nThe specific syntax supported is available on their\n\nGitHub page\n. JsonPath\nexpressions are case-sensitive.\n\n\nWhen extracting content from the document, only strings, arrays, and objects are considered. All\nother JSON types are ignored. When the JsonPath expression matches an array, each element is\nrecursively explored. When the expression matches an object, keys are left unchanged and each value\nof the object is recursively explored.\n\n\nJSON_PATH Matching Example\n\n\n{\n \"key1\": [\"a\", \"b\", \"c\"],\n \"key2\": {\n \"key3\": [\n {\n \"key4\": [\"d\", \"e\"],\n \"key5\": [\"f\", \"g\"],\n \"key6\": 6\n }\n ]\n }\n}\n\n\n\n\n\n\n\n\n\nExpression\n\n\nMatches\n\n\n\n\n\n\n\n\n\n\n$\n\n\na, b, c, d, e, f, g\n\n\n\n\n\n\n$.*\n\n\na, b, c, d, e, f, g\n\n\n\n\n\n\n$.key1\n\n\na, b, c\n\n\n\n\n\n\n$.key1[0]\n\n\na\n\n\n\n\n\n\n$.key2\n\n\nd, e, f, g\n\n\n\n\n\n\n$.key2.key3\n\n\nd, e, f, g\n\n\n\n\n\n\n$.key2.key3.*.key4\n\n\nd, e\n\n\n\n\n\n\n$.key2.key3.*.*[0]\n\n\nd, f\n\n\n\n\n\n\n\n\nJSON_PATH Output File\n\n\nWhen media selectors are used, the JsonOutputObject will contain a URI referencing the file\nlocation in the \n$.media.*.mediaSelectorsOutputUri\n field.\n\n\nFor example, consider that the \nmediaUri\n from the job in the\n\nNew Job Request Fields section\n refers to the document below.\n\n\n{\n \"otherStuffKey\": [\"other stuff value\"],\n \"spanishMessages\": [\n {\n \"to\": \"spanish recipient 1\",\n \"from\": \"spanish sender 1\",\n \"content\": \"\u00bfHola, c\u00f3mo est\u00e1s?\"\n },\n {\n \"to\": \"spanish recipient 2\",\n \"from\": \"spanish sender 2\",\n \"content\": \"\u00bfD\u00f3nde est\u00e1 la biblioteca?\"\n }\n ],\n \"chineseMessages\": [\n {\n \"to\": \"chinese recipient 1\",\n \"from\": \"chinese sender 1\",\n \"content\": \"\u73b0\u5728\u662f\u51e0\u594c\uff1f\"\n },\n {\n \"to\": \"chinese recipient 2\",\n \"from\": \"chinese sender 2\",\n \"content\": \"\u4f60\u53eb\u4ec0\u4e48\u540d\u5b57\uff1f\"\n },\n {\n \"to\": \"chinese recipient 3\",\n \"from\": \"chinese sender 3\",\n \"content\": \"\u4f60\u5728\u54ea\u91cc\uff1f\"\n }\n ]\n}\n\n\n\nThe \nmediaSelectorsOutputUri\n field will refer to a document containing the content below.\n\n\n{\n \"otherStuffKey\": [\"other stuff value\"],\n \"spanishMessages\": [\n {\n \"to\": \"spanish recipient 1\",\n \"from\": \"spanish sender 1\",\n \"content\": \"Hello, how are you?\"\n },\n {\n \"to\": \"spanish recipient 2\",\n \"from\": \"spanish sender 2\",\n \"content\": \"Where is the library?\"\n }\n ],\n \"chineseMessages\": [\n {\n \"to\": \"chinese recipient 1\",\n \"from\": \"chinese sender 1\",\n \"content\": \"What time is it?\"\n },\n {\n \"to\": \"chinese recipient 2\",\n \"from\": \"chinese sender 2\",\n \"content\": \"What is your name?\"\n },\n {\n \"to\": \"chinese recipient 3\",\n \"from\": \"chinese sender 3\",\n \"content\": \"Where are you?\"\n }\n ]\n}\n\n\n\nIf \nMEDIA_SELECTORS_DELIMETER\n was set to \" | Translation: \", the file would contain the content\nbelow.\n\n\n{\n \"otherStuffKey\": [\"other stuff value\"],\n \"spanishMessages\": [\n {\n \"to\": \"spanish recipient 1\",\n \"from\": \"spanish sender 1\",\n \"content\": \"\u00bfHola, c\u00f3mo est\u00e1s? | Translation: Hello, how are you?\"\n },\n {\n \"to\": \"spanish recipient 2\",\n \"from\": \"spanish sender 2\",\n \"content\": \"\u00bfD\u00f3nde est\u00e1 la biblioteca? | Translation: Where is the library?\"\n }\n ],\n \"chineseMessages\": [\n {\n \"to\": \"chinese recipient 1\",\n \"from\": \"chinese sender 1\",\n \"content\": \"\u73b0\u5728\u662f\u51e0\u594c\uff1f | Translation: What time is it?\"\n },\n {\n \"to\": \"chinese recipient 2\",\n \"from\": \"chinese sender 2\",\n \"content\": \"\u4f60\u53eb\u4ec0\u4e48\u540d\u5b57\uff1f | Translation: What is your name?\"\n },\n {\n \"to\": \"chinese recipient 3\",\n \"from\": \"chinese sender 3\",\n \"content\": \"\u4f60\u5728\u54ea\u91cc\uff1f | Translation: Where are you?\"\n }\n ]\n}\n\n\n\nCSV_COLS\n\n\nUsed to extract content from specific columns of a CSV file. The expression itself must be a\nsingle row of CSV listing the columns to extract. The \nCSV_SELECTORS_ARE_INDICES\n job property\ncontrols whether the entries refer to column names or zero-based integer indices.\n\n\nCSV-Specific Job Properties\n\n\n\n\n\n\nCSV_SELECTORS_ARE_INDICES\n: When \nFALSE\n (the default), the selector expression must contain\n column names. When \nTRUE\n the selector should contain the zero-based integer indices of the\n columns that should be processed.\n\n\n\n\n\n\nCSV_CSV_FIRST_ROW_IS_DATA\n: When \nFALSE\n (the default), the first row is considered headers and\n will not be processed. When \nTRUE\n, the first row is considered data and the first row will be\n processed.\n\n\n\n\n\n\nAn issue when processing CSV is that sometimes the first row is considered headers (a.k.a. column\nnames) and in others the first row is actually data and there are no headers.\n\n\n\n\n\n\nIn the default configuration (\nCSV_SELECTORS_ARE_INDICES\n = \nFALSE\n and\n \nCSV_FIRST_ROW_IS_DATA\n = \nFALSE\n), the selector expression refers to column names and the\n first row is not processed as data.\n\n\n\n\n\n\nIf the first row is actual data, and you want to specify the columns by index instead of by\n name in the selector expression, set \nCSV_SELECTORS_ARE_INDICES\n = \nTRUE\n and\n \nCSV_FIRST_ROW_IS_DATA\n = \nTRUE\n.\n\n\n\n\n\n\nIf the first row is headers, and you want to specify the columns by index instead of by\n name in the selector expression, set \nCSV_SELECTORS_ARE_INDICES\n = \nTRUE\n and\n \nCSV_FIRST_ROW_IS_DATA\n = \nFALSE\n.\n\n\n\n\n\n\nCSV_COLS Matching Example\n\n\nThe table below shows combinations of values for \nCSV_SELECTORS_ARE_INDICES\n and\n\nCSV_FIRST_ROW_IS_DATA\n when matched against this CSV content:\n\n\nheader0,header1,\"header,2\"\na,b,c\nd,e,f,g\n\n\n\n\n\n\n\n\n\nExpression\n\n\nCSV_SELECTORS_\nARE_INDICES\n\n\nCSV_FIRST_ROW_\nIS_DATA\n\n\nMatches\n\n\n\n\n\n\n\n\n\n\nheader0,\"header,2\"\n\n\nFALSE\n\n\nFALSE\n\n\na, c, d, f\n\n\n\n\n\n\nheader0,\"header,2\"\n\n\nFALSE\n\n\nTRUE\n\n\nheader0, \"header,2\", a, c, d, f\n\n\n\n\n\n\nheader0,headerX\n\n\nFALSE\n\n\nTRUE / FALSE\n\n\nError\n: \"headerX\" does not exist\n\n\n\n\n\n\nheader0,header,2\n\n\nFALSE\n\n\nTRUE / FALSE\n\n\nError\n: \"header\" and \"2\" do not exist\n\n\n\n\n\n\nheader0,\"header,2\"\n\n\nTRUE\n\n\nTRUE / FALSE\n\n\nError\n: The expression contains non-integers.\n\n\n\n\n\n\n0,2\n\n\nTRUE\n\n\nFALSE\n\n\na, c, d, f\n\n\n\n\n\n\n0,2\n\n\nTRUE\n\n\nTRUE\n\n\nheader0, \"header,2\", a, c, d, f\n\n\n\n\n\n\n0,3,4\n\n\nTRUE\n\n\nFALSE\n\n\na, d, g\n\n\n\n\n\n\n0,2\n\n\nFALSE\n\n\nTRUE / FALSE\n\n\nError\n: There are no columns with \"0\" or \"2\" as the header.\n\n\n\n\n\n\n\n\nCSV Text Encodings\n\n\nWe recommend submitting UTF-8-encoded CSV files, but we do attempt to recognize other text\nencodings. When attempting to determine the input file encoding, the Workflow Manager will inspect\nthe first 12,000 bytes of the file. If all of the 12,000 bytes are valid UTF-8 bytes, then the\nWorkflow Manager will treat the file as UTF-8. Otherwise, the Workflow Manager will use\n\nTika's \nCharsetDetector\n\nto determine the encoding.\n\n\nThe media selectors output file will always be UTF-8-encoded. If the input file was UTF-8-encoded\nand had a byte-order mark, then a byte-order mark will be added to the output file.\n\n\nByte-order mark\n\n\nThe UTF-8, UTF-16, and UTF-32 text encodings may have a byte-order mark present. The byte-order\nmark is the Unicode character named \"ZERO WIDTH NO-BREAK SPACE\" with a code point of U+FEFF. Each\nencoding will encode it as different bytes. For example, in UTF-8 it is encoded with three bytes:\n\n0xEF\n, \n0xBB\n, \n0xBF\n.\n\n\nMany CSV parsers do not have special handling for the byte-order mark. They just treat it as a\nnormal character and consider it to be the first character in the first cell. The Workflow Manager\ndiscards the byte-order mark before parsing the CSV.\n\n\nExcel\n\n\n\n\n\n\nIf you open a CSV file in Microsoft Excel and the text is garbled, you should open the file\n in a text editor that supports UTF-8 and see if the text is garbled there too.\n\n\n\n\n\n\nWhen saving a CSV file from Excel, if you select \"CSV (Comma delimited)(*.csv)\", Excel will\n silently replace East Asian characters with question marks. Selecting\n \"CSV UTF-8 (Comma delimited) (.csv)\" preserves the East Asian characters, but it adds a\n byte-order mark to the file.\n\n\n\n\n\n\nIf you open a UTF-8-encoded file in Excel, it will treat it as ISO-8859-1 unless the file has\n a UTF-8 byte-order mark.\n\n\n\n\n\n\nUsing a byte-order mark with UTF-8 is uncommon because the UTF-8 encoding does not have endianess\nlike UTF-16 and UTF-32. The byte-order mark added by Excel can cause problems because a\nlot of software does not expect it to be present.\n\n\nAs an example, consider an Excel spreadsheet with the following content:\n\n\n\n\n\n\n\n\nCol,1\n\n\nCol,2\n\n\n\n\n\n\n\n\n\n\nitem1\n\n\nitem2\n\n\n\n\n\n\n\n\nIf you save that as \"CSV UTF-8 (Comma delimited) (.csv)\" and then \ncat\n the file you will get:\n\n\n\"Col,1\",\"Col,2\"\nitem1,item2\n\n\n\nSince the two column names contain commas, the cells need to be escaped with quotation marks so that\nthe internal comma is not interpreted as a record separator. If you parse that file with\n\nPython's built-in CSV parser\n you get the following\nresult:\n\n\n\n\n\n\n\n\n\ufeff\"Col\n\n\n1\"\n\n\nCol,2\n\n\n\n\n\n\n\n\n\n\nitem1\n\n\nitem2\n\n\n\n\n\n\n\n\n\n\nWhile the first cell above appears to be four characters in length, it is actually five. The first\ncharacter is U+FEFF (ZERO WIDTH NO-BREAK SPACE). Since the first cell starts with\nU+FEFF (ZERO WIDTH NO-BREAK SPACE), rather than U+0022 (QUOTATION MARK), the first comma is\nunescaped so it is interpreted as the record separator.\n\n\nSince the byte-order mark is invisible when rendered, simply printing the content will not reveal\nthe issue. It is visible in a hexdump produced with \nhexdump -C\n on Linux or \nFormat-Hex\n in\nWindows Powershell. In the output from \nhexdump -C\n below, you can see that there is content before\nthe quotation mark and that content matches the UTF-8-encoded byte-order mark of \n0xEF\n, \n0xBB\n,\n\n0xBF\n.\n\n\n00000000 ef bb bf 22 43 6f 6c 2c 31 22 2c 22 43 6f 6c 2c |...\"Col,1\",\"Col,|\n00000010 32 22 0d 0a 69 74 65 6d 31 2c 69 74 65 6d 32 0d |2\"..item1,item2.|\n00000020 0a |.|\n\n\n\nCSV_COLS Output File\n\n\nWhen media selectors are used, a copy of the input file with the specified sections replaced by\ncomponent output is produced. The URI to the file will be present in the\n\n$.media.*.mediaSelectorsOutputUri\n field.\n\n\nBelow is an example of a job that uses \nCSV_COLS\n media selectors. The job uses a two-stage\npipeline. The first stage performs language identification. The second performs translation.\n\n\n{\n \"algorithmProperties\": {},\n \"buildOutput\": true,\n \"jobProperties\": {},\n \"media\": [\n {\n \"mediaUri\": \"file:///opt/mpf/share/remote-media/test-csv-translation.csv\",\n \"properties\": {},\n \"mediaSelectorsOutputAction\": \"ARGOS TRANSLATION (WITH FF REGION AND NOT ANNOTATOR) ACTION\",\n \"mediaSelectors\": [\n {\n \"type\": \"CSV_COLS\",\n \"expression\": \"Spanish\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n },\n {\n \"type\": \"CSV_COLS\",\n \"expression\": \"Chinese\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n }\n ]\n }\n ],\n \"pipelineName\": \"ARGOS TRANSLATION (WITH FASTTEXT LANGUAGE ID) TEXT FILE PIPELINE\",\n \"priority\": 4\n}\n\n\n\nThe input file, \ntest-csv-translation.csv\n, contains the content below.\n\n\nEnglish,Spanish,Chinese\n\"Hello, how are you?\",\"\u00bfHola, c\u00f3mo est\u00e1s?\",\u4f60\u597d\u5417\uff1f\nWhere is the library?,\u00bfD\u00f3nde est\u00e1 la biblioteca?,\u56fe\u4e66\u9986\u5728\u54ea\u91cc\uff1f\nWhat time is it?,\u00bfQu\u00e9 hora es?,\u73b0\u5728\u662f\u51e0\u594c\uff1f\n\n\n\nThe \nmediaSelectorsOutputUri\n field from the output object will refer to a document containing the\ncontent below.\n\n\nEnglish,Spanish,Chinese\n\"Hello, how are you?\",\"Hello, how are you?\",How are you?\nWhere is the library?,Where's the library?,Where's the library?\nWhat time is it?,What time is it?,What time is it?\n\n\n\nIf \nMEDIA_SELECTORS_DELIMETER\n was set to \" | Translation: \", the file would contain the content\nbelow.\n\n\nEnglish,Spanish,Chinese\n\"Hello, how are you?\",\"\u00bfHola, c\u00f3mo est\u00e1s? | Translation: Hello, how are you?\",\u4f60\u597d\u5417\uff1f | Translation: How are you?\nWhere is the library?,\u00bfD\u00f3nde est\u00e1 la biblioteca? | Translation: Where's the library?,\u56fe\u4e66\u9986\u5728\u54ea\u91cc\uff1f | Translation: Where's the library?\nWhat time is it?,\u00bfQu\u00e9 hora es? | Translation: What time is it?,\u73b0\u5728\u662f\u51e0\u594c\uff1f | Translation: What time is it?",
"title": "Media Selectors Guide"
},
{
@@ -687,7 +682,7 @@
},
{
"location": "/Media-Selectors-Guide/index.html#new-job-request-fields",
- "text": "Below is an example of a job that uses JSON_PATH media selectors. The job uses a two-stage\npipeline. The first stage performs language identification. The second performs translation. {\n \"algorithmProperties\": {},\n \"buildOutput\": true,\n \"jobProperties\": {},\n \"media\": [\n {\n \"mediaUri\": \"file:///opt/mpf/share/remote-media/test-json-path-translation.json\",\n \"properties\": {},\n \"mediaSelectorsOutputAction\": \"ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) ACTION\",\n \"mediaSelectors\": [\n {\n \"type\": \"JSON_PATH\",\n \"expression\": \"$.spanishMessages.*.content\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n },\n {\n \"type\": \"JSON_PATH\",\n \"expression\": \"$.chineseMessages.*.content\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n }\n ]\n }\n ],\n \"pipelineName\": \"ARGOS TRANSLATION (WITH FASTTEXT LANGUAGE ID) TEXT FILE PIPELINE\",\n \"priority\": 4\n} $.media.*.mediaSelectorsOutputAction : Name of the action that produces content for the media\n selectors output file. In the above example, we specify that we want the translated content\n from Argos in the media selectors output file rather than the detected language from the first\n stage. $.media.*.mediaSelectors : List of media selectors that will be used for the media. $.media.*.mediaSelectors.*.type : The name of the type of media selector \n that is used in the expression field. $.media.*.mediaSelectors.*.expression : A string specifying the sections of the document that\n should be processed. The type field specifies the syntax of the expression. $.media.*.mediaSelectors.*.resultDetectionProperty : A detection property name from tracks\n produced by the mediaSelectorsOutputAction . The media selectors output document will be\n populated with the content of the specified property. $.media.*.mediaSelectors.*.selectionProperties : Job properties that will only be used for\n sub-jobs created for a specific media selector. For example, when performing Argos translation\n on a JSON file in a single-stage pipeline without an upstream language detection stage, this\n could set DEFAULT_SOURCE_LANGUAGE=es for some media selectors and\n DEFAULT_SOURCE_LANGUAGE=zh for others.",
+ "text": "Below is an example of a job that uses JSON_PATH media selectors. The job uses a two-stage\npipeline. The first stage performs language identification. The second performs translation. {\n \"algorithmProperties\": {},\n \"buildOutput\": true,\n \"jobProperties\": {},\n \"media\": [\n {\n \"mediaUri\": \"file:///opt/mpf/share/remote-media/test-json-path-translation.json\",\n \"properties\": {},\n \"mediaSelectorsOutputAction\": \"ARGOS TRANSLATION (WITH FF REGION AND NOT ANNOTATOR) ACTION\",\n \"mediaSelectors\": [\n {\n \"type\": \"JSON_PATH\",\n \"expression\": \"$.spanishMessages.*.content\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n },\n {\n \"type\": \"JSON_PATH\",\n \"expression\": \"$.chineseMessages.*.content\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n }\n ]\n }\n ],\n \"pipelineName\": \"ARGOS TRANSLATION (WITH FASTTEXT LANGUAGE ID) TEXT FILE PIPELINE\",\n \"priority\": 4\n} $.media.*.mediaSelectorsOutputAction : Name of the action that produces content for the media\n selectors output file. In the above example, we specify that we want the translated content\n from Argos in the media selectors output file rather than the detected language from the first\n stage. $.media.*.mediaSelectors : List of media selectors that will be used for the media. $.media.*.mediaSelectors.*.type : The name of the type of media selector \n that is used in the expression field. $.media.*.mediaSelectors.*.expression : A string specifying the sections of the document that\n should be processed. The type field specifies the syntax of the expression. $.media.*.mediaSelectors.*.resultDetectionProperty : A detection property name from tracks\n produced by the mediaSelectorsOutputAction . The media selectors output document will be\n populated with the content of the specified property. $.media.*.mediaSelectors.*.selectionProperties : Job properties that will only be used for\n sub-jobs created for a specific media selector. For example, when performing Argos translation\n on a JSON file in a single-stage pipeline without an upstream language detection stage, this\n could set DEFAULT_SOURCE_LANGUAGE=es for some media selectors and\n DEFAULT_SOURCE_LANGUAGE=zh for others.",
"title": "New Job Request Fields"
},
{
@@ -747,7 +742,7 @@
},
{
"location": "/Media-Selectors-Guide/index.html#csv_cols-output-file",
- "text": "When media selectors are used, a copy of the input file with the specified sections replaced by\ncomponent output is produced. The URI to the file will be present in the $.media.*.mediaSelectorsOutputUri field. Below is an example of a job that uses CSV_COLS media selectors. The job uses a two-stage\npipeline. The first stage performs language identification. The second performs translation. {\n \"algorithmProperties\": {},\n \"buildOutput\": true,\n \"jobProperties\": {},\n \"media\": [\n {\n \"mediaUri\": \"file:///opt/mpf/share/remote-media/test-csv-translation.csv\",\n \"properties\": {},\n \"mediaSelectorsOutputAction\": \"ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) ACTION\",\n \"mediaSelectors\": [\n {\n \"type\": \"CSV_COLS\",\n \"expression\": \"Spanish\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n },\n {\n \"type\": \"CSV_COLS\",\n \"expression\": \"Chinese\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n }\n ]\n }\n ],\n \"pipelineName\": \"ARGOS TRANSLATION (WITH FASTTEXT LANGUAGE ID) TEXT FILE PIPELINE\",\n \"priority\": 4\n} The input file, test-csv-translation.csv , contains the content below. English,Spanish,Chinese\n\"Hello, how are you?\",\"\u00bfHola, c\u00f3mo est\u00e1s?\",\u4f60\u597d\u5417\uff1f\nWhere is the library?,\u00bfD\u00f3nde est\u00e1 la biblioteca?,\u56fe\u4e66\u9986\u5728\u54ea\u91cc\uff1f\nWhat time is it?,\u00bfQu\u00e9 hora es?,\u73b0\u5728\u662f\u51e0\u594c\uff1f The mediaSelectorsOutputUri field from the output object will refer to a document containing the\ncontent below. English,Spanish,Chinese\n\"Hello, how are you?\",\"Hello, how are you?\",How are you?\nWhere is the library?,Where's the library?,Where's the library?\nWhat time is it?,What time is it?,What time is it? If MEDIA_SELECTORS_DELIMETER was set to \" | Translation: \", the file would contain the content\nbelow. English,Spanish,Chinese\n\"Hello, how are you?\",\"\u00bfHola, c\u00f3mo est\u00e1s? | Translation: Hello, how are you?\",\u4f60\u597d\u5417\uff1f | Translation: How are you?\nWhere is the library?,\u00bfD\u00f3nde est\u00e1 la biblioteca? | Translation: Where's the library?,\u56fe\u4e66\u9986\u5728\u54ea\u91cc\uff1f | Translation: Where's the library?\nWhat time is it?,\u00bfQu\u00e9 hora es? | Translation: What time is it?,\u73b0\u5728\u662f\u51e0\u594c\uff1f | Translation: What time is it?",
+ "text": "When media selectors are used, a copy of the input file with the specified sections replaced by\ncomponent output is produced. The URI to the file will be present in the $.media.*.mediaSelectorsOutputUri field. Below is an example of a job that uses CSV_COLS media selectors. The job uses a two-stage\npipeline. The first stage performs language identification. The second performs translation. {\n \"algorithmProperties\": {},\n \"buildOutput\": true,\n \"jobProperties\": {},\n \"media\": [\n {\n \"mediaUri\": \"file:///opt/mpf/share/remote-media/test-csv-translation.csv\",\n \"properties\": {},\n \"mediaSelectorsOutputAction\": \"ARGOS TRANSLATION (WITH FF REGION AND NOT ANNOTATOR) ACTION\",\n \"mediaSelectors\": [\n {\n \"type\": \"CSV_COLS\",\n \"expression\": \"Spanish\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n },\n {\n \"type\": \"CSV_COLS\",\n \"expression\": \"Chinese\",\n \"resultDetectionProperty\": \"TRANSLATION\",\n \"selectionProperties\": {}\n }\n ]\n }\n ],\n \"pipelineName\": \"ARGOS TRANSLATION (WITH FASTTEXT LANGUAGE ID) TEXT FILE PIPELINE\",\n \"priority\": 4\n} The input file, test-csv-translation.csv , contains the content below. English,Spanish,Chinese\n\"Hello, how are you?\",\"\u00bfHola, c\u00f3mo est\u00e1s?\",\u4f60\u597d\u5417\uff1f\nWhere is the library?,\u00bfD\u00f3nde est\u00e1 la biblioteca?,\u56fe\u4e66\u9986\u5728\u54ea\u91cc\uff1f\nWhat time is it?,\u00bfQu\u00e9 hora es?,\u73b0\u5728\u662f\u51e0\u594c\uff1f The mediaSelectorsOutputUri field from the output object will refer to a document containing the\ncontent below. English,Spanish,Chinese\n\"Hello, how are you?\",\"Hello, how are you?\",How are you?\nWhere is the library?,Where's the library?,Where's the library?\nWhat time is it?,What time is it?,What time is it? If MEDIA_SELECTORS_DELIMETER was set to \" | Translation: \", the file would contain the content\nbelow. English,Spanish,Chinese\n\"Hello, how are you?\",\"\u00bfHola, c\u00f3mo est\u00e1s? | Translation: Hello, how are you?\",\u4f60\u597d\u5417\uff1f | Translation: How are you?\nWhere is the library?,\u00bfD\u00f3nde est\u00e1 la biblioteca? | Translation: Where's the library?,\u56fe\u4e66\u9986\u5728\u54ea\u91cc\uff1f | Translation: Where's the library?\nWhat time is it?,\u00bfQu\u00e9 hora es? | Translation: What time is it?,\u73b0\u5728\u662f\u51e0\u594c\uff1f | Translation: What time is it?",
"title": "CSV_COLS Output File"
},
{
diff --git a/docs/site/sitemap.xml b/docs/site/sitemap.xml
index c72fe20b7b13..4ecda5c550be 100644
--- a/docs/site/sitemap.xml
+++ b/docs/site/sitemap.xml
@@ -2,162 +2,162 @@
/index.html
- 2026-01-29
+ 2026-02-19
daily
/Release-Notes/index.html
- 2026-01-29
+ 2026-02-19
daily
/License-And-Distribution/index.html
- 2026-01-29
+ 2026-02-19
daily
/Acknowledgements/index.html
- 2026-01-29
+ 2026-02-19
daily
/Install-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Admin-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/User-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/OpenID-Connect-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Media-Segmentation-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Feed-Forward-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Derivative-Media-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Object-Storage-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Markup-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/TiesDb-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Trigger-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Roll-Up-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Health-Check-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Artifact-Extraction-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Quality-Selection-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Media-Selectors-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/REST-API/index.html
- 2026-01-29
+ 2026-02-19
daily
/Component-API-Overview/index.html
- 2026-01-29
+ 2026-02-19
daily
/Component-Descriptor-Reference/index.html
- 2026-01-29
+ 2026-02-19
daily
/CPP-Batch-Component-API/index.html
- 2026-01-29
+ 2026-02-19
daily
/Python-Batch-Component-API/index.html
- 2026-01-29
+ 2026-02-19
daily
/Java-Batch-Component-API/index.html
- 2026-01-29
+ 2026-02-19
daily
/GPU-Support-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Contributor-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Development-Environment-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Node-Guide/index.html
- 2026-01-29
+ 2026-02-19
daily
/Workflow-Manager-Architecture/index.html
- 2026-01-29
+ 2026-02-19
daily
/CPP-Streaming-Component-API/index.html
- 2026-01-29
+ 2026-02-19
daily
\ No newline at end of file