diff --git a/docs/docs/Artifact-Extraction-Guide.md b/docs/docs/Artifact-Extraction-Guide.md index 5672d781dc3b..b8702d2810fb 100644 --- a/docs/docs/Artifact-Extraction-Guide.md +++ b/docs/docs/Artifact-Extraction-Guide.md @@ -17,6 +17,9 @@ written to local shared storage, or to a remote S3 storage location. Refer to th The choice of which artifacts to extract is highly configurable using the following properties. +- `SUPPRESS_TRACKS`: When an action has this property set to `true`, no artifacts for that action + will be extracted and none of the other properties listed below will have any effect. + - `ARTIFACT_EXTRACTION_POLICY`: This property sets the high level policy controlling artifact extraction. It must have one of the following values: - `NONE`: No artifact extraction will be performed. diff --git a/docs/docs/Derivative-Media-Guide.md b/docs/docs/Derivative-Media-Guide.md index 0a2954221a36..9668d7911b20 100644 --- a/docs/docs/Derivative-Media-Guide.md +++ b/docs/docs/Derivative-Media-Guide.md @@ -67,150 +67,88 @@ To break down each stage of this pipeline:

- `KEYWORD TAGGING (WITH FF REGIONS) ACTION`: The KeywordTagging component will take the `TEXT` tracks from the previous `TIKA TEXT` and `TESSERACT OCR` actions and perform keyword tagging. This will add the `TAGS` - , `TRIGGER_WORDS`, and `TRIGGER_WORDS_OFFSET` properties to each track. + , `TRIGGER_WORDS`, and `TRIGGER_WORDS_OFFSET` properties to each track. The action has the + `IS_ANNOTATOR` property set to `TRUE`

- `OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION`: The Markup component will take the keyword-tagged `TEXT` tracks for the derivative media and draw bounding boxes on the extracted images. -## Task Merging -The large blue rectangles in the diagram represent tasks that are merged together. The purpose of task merging is to -consolidate how tracks are represented in the JSON output object by hiding redundant track information, and to make it -appear that the behaviors of two or more actions are the result of a single algorithm. +## Annotators -For example, keyword tagging behavior is supplemental to the text detection behavior. It's more important that `TEXT` -tracks are associated with the algorithm that performed text detection than the `KEYWORDTAGGING` algorithm. Note that in -our pipeline only the `KEYWORD TAGGING` action has the `OUTPUT_MERGE_WITH_PREVIOUS_TASK` property set to `TRUE`. It has -a similar effect in the source media flow and derivative media flow. +When a pipeline does not use derivative media, an action with `IS_ANNOTATOR=true`, always annotates +the action immediately proceeding it. When a pipeline uses derivative media, an action with +`IS_ANNOTATOR=true` annotates the last action that was applicable to the media type. In the example +above, `KEYWORD TAGGING` action, has `IS_ANNOTATOR=true`. -In the source media flow the `TIKA TEXT` action is at the start of the merge chain while the `KEYWORD TAGGING` action is -at the end of the merge chain. The tracks generated by the action at the end of the merge chain inherit the algorithm -and track type from the tracks at the beginning of the merge chain. The effect is that in the JSON output object the -tracks from the `TIKA TEXT` action will not be shown. Instead that action will be listed under `TRACKS MERGED`. The -tracks from the `KEYWORD TAGGING` action will be shown with the `TIKATEXT` algorithm and `TEXT` track type: +When determining which action `KEYWORD TAGGING` annotates in the source media flow, the +`TESSERACT OCR` and `EAST` actions are considered, but are not selected because neither applies to +the source media. The `TIKA TEXT` action is considered and then selected because it applies to the +source media. Below is example output for the source media. The tracks contained in the `TEXT` +section will include the properties added by the `TIKA TEXT` action and the properties added by the +`KEYWORD TAGGING` action. ```json -"output": { - "TRACKS MERGED": [ - { - "source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION", - "algorithm": "TIKATEXT" - } - ], - "MEDIA": [ - { - "source": "+#TIKA IMAGE DETECTION ACTION", - "algorithm": "TIKAIMAGE", - "tracks": [ ... ] - } - ], - "TEXT": [ - { - "source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION", - "algorithm": "TIKATEXT", - "tracks": [ ... ] - } - ] +{ + "output": { + "MEDIA": [ + { + "action": "TIKA IMAGE DETECTION ACTION", + "algorithm": "TIKAIMAGE", + "annotators": [], + "tracks": ["..."] + } + ], + "TEXT": [ + { + "action": "TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION", + "algorithm": "TIKATEXT", + "annotators": ["KEYWORD TAGGING (WITH FF REGION) ACTION"], + "tracks": ["..."] + } + ] + } } ``` -In the derivative media flow the `TESSERACT OCR` action is at the start of the merge chain while the `KEYWORD TAGGING` -action is at the end of the merge chain. The effect is that in the JSON output object the tracks from -the `TESSERACT OCR` action will not be shown. The tracks from the `KEYWORD TAGGING` action will be shown with -the `TESSERACTOCR` algorithm and `TEXT` track type: +When determining which action `KEYWORD TAGGING` annotates in the derivative media flow, +`TESSERACT OCR` is selected because it is the first action before `KEYWORD TAGGING` that applies to +derivative media. Below is example output for the derivative media. The tracks contained in the +`TEXT` section will include the properties added by the `TESSERACT OCR` action and the properties +added by the `KEYWORD TAGGING` action. ```json -"output": { - "NO TRACKS": [ - { - "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION#OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION", - "algorithm": "MARKUPCV" - } - ], - "TRACKS MERGED": [ - { - "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION", - "algorithm": "TESSERACTOCR" - } - ], - "TEXT": [ - { - "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION", - "algorithm": "TESSERACTOCR", - "tracks": [ ... ] - } - ], - "TEXT REGION": [ - { - "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION", - "algorithm": "EAST", - "tracks": [ ... ] - } - ] +{ + "output": { + "NO TRACKS": [ + { + "action": "OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION", + "algorithm": "MARKUPCV", + "annotators": [], + } + ], + "TEXT": [ + { + "action": "TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION", + "algorithm": "TESSERACTOCR", + "annotators": ["KEYWORD TAGGING (WITH FF REGION) ACTION"], + "tracks": ["..."] + } + ], + "TEXT REGION": [ + { + "action": "EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION", + "algorithm": "EAST", + "annotators": [], + "tracks": ["..."] + } + ] + } } ``` +Note that a `MARKUP` action will never generate new tracks. It simply fills out the +`media.markupResult` field in the JSON output object (not shown above). -Note that a `MARKUP` action will never generate new tracks. It simply fills out the `media.markupResult` field in the -JSON output object (not shown above). - -## Output Last Task Only - -If you want to omit all tracks from the JSON output object but the respective `TEXT` tracks for the source and -derivative media, then in you can also set the `OUTPUT_LAST_TASK_ONLY` job property to `TRUE`. Note that the WFM only -considers tasks that use `DETECTION` algorithms as the final task, so `MARKUP` is ignored. Setting this property will -result in the following JSON for the source media: - -```json -"output": { - "TRACKS SUPPRESSED": [ - { - "source": "+#TIKA IMAGE DETECTION ACTION", - "algorithm": "TIKAIMAGE" - }, - { - "source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION", - "algorithm": "TIKATEXT" - } - ], - "TEXT": [ - { - "source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION", - "algorithm": "TIKATEXT", - "tracks": [ ... ] - } - ] -} -``` - -And the following JSON for the derivative media: - -```json -"output": { - "NO TRACKS": [ - { - "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION#OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION", - "algorithm": "MARKUPCV" - } - ], - "TRACKS SUPPRESSED": [ - { - "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION", - "algorithm": "EAST" - }, - { - "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION", - "algorithm": "TESSERACTOCR" - } - ], - "TEXT": [ - { - "source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION", - "algorithm": "TESSERACTOCR", - "tracks": [ ... ] - } - ] -} -``` # Developing Media Extraction Components @@ -235,4 +173,4 @@ that components in the subsequent pipeline stages can handle the media type dete # Default Pipelines -OpenMPF comes with some default pipelines for detecting text in documents and other pipelines for detecting faces in documents. Refer to the TikaImageDetection [`descriptor.json`](https://github.com/openmpf/openmpf-components/blob/master/java/TikaImageDetection/plugin-files/descriptor/descriptor.json). \ No newline at end of file +OpenMPF comes with some default pipelines for detecting text in documents and other pipelines for detecting faces in documents. Refer to the TikaImageDetection [`descriptor.json`](https://github.com/openmpf/openmpf-components/blob/master/java/TikaImageDetection/plugin-files/descriptor/descriptor.json). diff --git a/docs/docs/Media-Selectors-Guide.md b/docs/docs/Media-Selectors-Guide.md index 1122580f8c74..bae9f03677d8 100644 --- a/docs/docs/Media-Selectors-Guide.md +++ b/docs/docs/Media-Selectors-Guide.md @@ -23,7 +23,7 @@ pipeline. The first stage performs language identification. The second performs { "mediaUri": "file:///opt/mpf/share/remote-media/test-json-path-translation.json", "properties": {}, - "mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) ACTION", + "mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NOT ANNOTATOR) ACTION", "mediaSelectors": [ { "type": "JSON_PATH", @@ -406,7 +406,7 @@ pipeline. The first stage performs language identification. The second performs { "mediaUri": "file:///opt/mpf/share/remote-media/test-csv-translation.csv", "properties": {}, - "mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) ACTION", + "mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NOT ANNOTATOR) ACTION", "mediaSelectors": [ { "type": "CSV_COLS", diff --git a/docs/docs/img/derivative-media-pipeline.png b/docs/docs/img/derivative-media-pipeline.png index 2c6523e1ee4a..b17ba93bcc40 100644 Binary files a/docs/docs/img/derivative-media-pipeline.png and b/docs/docs/img/derivative-media-pipeline.png differ diff --git a/docs/site/Artifact-Extraction-Guide/index.html b/docs/site/Artifact-Extraction-Guide/index.html index 47080cb21f9e..e1a6c873c263 100644 --- a/docs/site/Artifact-Extraction-Guide/index.html +++ b/docs/site/Artifact-Extraction-Guide/index.html @@ -279,8 +279,14 @@

Introduction

Artifact Extraction Properties

The choice of which artifacts to extract is highly configurable using the following properties.