Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/docs/Artifact-Extraction-Guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ written to local shared storage, or to a remote S3 storage location. Refer to th

The choice of which artifacts to extract is highly configurable using the following properties.

- `SUPPRESS_TRACKS`: When an action has this property set to `true`, no artifacts for that action
will be extracted and none of the other properties listed below will have any effect.

- `ARTIFACT_EXTRACTION_POLICY`: This property sets the high level policy controlling artifact extraction. It must have
one of the following values:
- `NONE`: No artifact extraction will be performed.
Expand Down
194 changes: 66 additions & 128 deletions docs/docs/Derivative-Media-Guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,150 +67,88 @@ To break down each stage of this pipeline:
<br/><br/>
- `KEYWORD TAGGING (WITH FF REGIONS) ACTION`: The KeywordTagging component will take the `TEXT` tracks from the
previous `TIKA TEXT` and `TESSERACT OCR` actions and perform keyword tagging. This will add the `TAGS`
, `TRIGGER_WORDS`, and `TRIGGER_WORDS_OFFSET` properties to each track.
, `TRIGGER_WORDS`, and `TRIGGER_WORDS_OFFSET` properties to each track. The action has the
`IS_ANNOTATOR` property set to `TRUE`
<br/><br/>
- `OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION`: The Markup component will take the keyword-tagged `TEXT` tracks for
the derivative media and draw bounding boxes on the extracted images.

## Task Merging

The large blue rectangles in the diagram represent tasks that are merged together. The purpose of task merging is to
consolidate how tracks are represented in the JSON output object by hiding redundant track information, and to make it
appear that the behaviors of two or more actions are the result of a single algorithm.
## Annotators

For example, keyword tagging behavior is supplemental to the text detection behavior. It's more important that `TEXT`
tracks are associated with the algorithm that performed text detection than the `KEYWORDTAGGING` algorithm. Note that in
our pipeline only the `KEYWORD TAGGING` action has the `OUTPUT_MERGE_WITH_PREVIOUS_TASK` property set to `TRUE`. It has
a similar effect in the source media flow and derivative media flow.
When a pipeline does not use derivative media, an action with `IS_ANNOTATOR=true`, always annotates
the action immediately proceeding it. When a pipeline uses derivative media, an action with
`IS_ANNOTATOR=true` annotates the last action that was applicable to the media type. In the example
above, `KEYWORD TAGGING` action, has `IS_ANNOTATOR=true`.

In the source media flow the `TIKA TEXT` action is at the start of the merge chain while the `KEYWORD TAGGING` action is
at the end of the merge chain. The tracks generated by the action at the end of the merge chain inherit the algorithm
and track type from the tracks at the beginning of the merge chain. The effect is that in the JSON output object the
tracks from the `TIKA TEXT` action will not be shown. Instead that action will be listed under `TRACKS MERGED`. The
tracks from the `KEYWORD TAGGING` action will be shown with the `TIKATEXT` algorithm and `TEXT` track type:
When determining which action `KEYWORD TAGGING` annotates in the source media flow, the
`TESSERACT OCR` and `EAST` actions are considered, but are not selected because neither applies to
the source media. The `TIKA TEXT` action is considered and then selected because it applies to the
source media. Below is example output for the source media. The tracks contained in the `TEXT`
section will include the properties added by the `TIKA TEXT` action and the properties added by the
`KEYWORD TAGGING` action.

```json
"output": {
"TRACKS MERGED": [
{
"source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION",
"algorithm": "TIKATEXT"
}
],
"MEDIA": [
{
"source": "+#TIKA IMAGE DETECTION ACTION",
"algorithm": "TIKAIMAGE",
"tracks": [ ... ]
}
],
"TEXT": [
{
"source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION",
"algorithm": "TIKATEXT",
"tracks": [ ... ]
}
]
{
"output": {
"MEDIA": [
{
"action": "TIKA IMAGE DETECTION ACTION",
"algorithm": "TIKAIMAGE",
"annotators": [],
"tracks": ["..."]
}
],
"TEXT": [
{
"action": "TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION",
"algorithm": "TIKATEXT",
"annotators": ["KEYWORD TAGGING (WITH FF REGION) ACTION"],
"tracks": ["..."]
}
]
}
}
```

In the derivative media flow the `TESSERACT OCR` action is at the start of the merge chain while the `KEYWORD TAGGING`
action is at the end of the merge chain. The effect is that in the JSON output object the tracks from
the `TESSERACT OCR` action will not be shown. The tracks from the `KEYWORD TAGGING` action will be shown with
the `TESSERACTOCR` algorithm and `TEXT` track type:
When determining which action `KEYWORD TAGGING` annotates in the derivative media flow,
`TESSERACT OCR` is selected because it is the first action before `KEYWORD TAGGING` that applies to
derivative media. Below is example output for the derivative media. The tracks contained in the
`TEXT` section will include the properties added by the `TESSERACT OCR` action and the properties
added by the `KEYWORD TAGGING` action.

```json
"output": {
"NO TRACKS": [
{
"source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION#OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION",
"algorithm": "MARKUPCV"
}
],
"TRACKS MERGED": [
{
"source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION",
"algorithm": "TESSERACTOCR"
}
],
"TEXT": [
{
"source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION",
"algorithm": "TESSERACTOCR",
"tracks": [ ... ]
}
],
"TEXT REGION": [
{
"source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION",
"algorithm": "EAST",
"tracks": [ ... ]
}
]
{
"output": {
"NO TRACKS": [
{
"action": "OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION",
"algorithm": "MARKUPCV",
"annotators": [],
}
],
"TEXT": [
{
"action": "TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION",
"algorithm": "TESSERACTOCR",
"annotators": ["KEYWORD TAGGING (WITH FF REGION) ACTION"],
"tracks": ["..."]
}
],
"TEXT REGION": [
{
"action": "EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION",
"algorithm": "EAST",
"annotators": [],
"tracks": ["..."]
}
]
}
}
```
Note that a `MARKUP` action will never generate new tracks. It simply fills out the
`media.markupResult` field in the JSON output object (not shown above).

Note that a `MARKUP` action will never generate new tracks. It simply fills out the `media.markupResult` field in the
JSON output object (not shown above).

## Output Last Task Only

If you want to omit all tracks from the JSON output object but the respective `TEXT` tracks for the source and
derivative media, then in you can also set the `OUTPUT_LAST_TASK_ONLY` job property to `TRUE`. Note that the WFM only
considers tasks that use `DETECTION` algorithms as the final task, so `MARKUP` is ignored. Setting this property will
result in the following JSON for the source media:

```json
"output": {
"TRACKS SUPPRESSED": [
{
"source": "+#TIKA IMAGE DETECTION ACTION",
"algorithm": "TIKAIMAGE"
},
{
"source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION",
"algorithm": "TIKATEXT"
}
],
"TEXT": [
{
"source": "+#TIKA IMAGE DETECTION ACTION#TIKA TEXT DETECTION SOURCE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION",
"algorithm": "TIKATEXT",
"tracks": [ ... ]
}
]
}
```

And the following JSON for the derivative media:

```json
"output": {
"NO TRACKS": [
{
"source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION#OCV GENERIC MARKUP DERIVATIVE MEDIA ONLY ACTION",
"algorithm": "MARKUPCV"
}
],
"TRACKS SUPPRESSED": [
{
"source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION",
"algorithm": "EAST"
},
{
"source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION",
"algorithm": "TESSERACTOCR"
}
],
"TEXT": [
{
"source": "+#EAST TEXT DETECTION DERIVATIVE MEDIA ONLY ACTION#TESSERACT OCR TEXT DETECTION (WITH FF REGION) DERIVATIVE MEDIA ONLY ACTION#KEYWORD TAGGING (WITH FF REGION) ACTION",
"algorithm": "TESSERACTOCR",
"tracks": [ ... ]
}
]
}
```

# Developing Media Extraction Components

Expand All @@ -235,4 +173,4 @@ that components in the subsequent pipeline stages can handle the media type dete

# Default Pipelines

OpenMPF comes with some default pipelines for detecting text in documents and other pipelines for detecting faces in documents. Refer to the TikaImageDetection [`descriptor.json`](https://github.com/openmpf/openmpf-components/blob/master/java/TikaImageDetection/plugin-files/descriptor/descriptor.json).
OpenMPF comes with some default pipelines for detecting text in documents and other pipelines for detecting faces in documents. Refer to the TikaImageDetection [`descriptor.json`](https://github.com/openmpf/openmpf-components/blob/master/java/TikaImageDetection/plugin-files/descriptor/descriptor.json).
4 changes: 2 additions & 2 deletions docs/docs/Media-Selectors-Guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ pipeline. The first stage performs language identification. The second performs
{
"mediaUri": "file:///opt/mpf/share/remote-media/test-json-path-translation.json",
"properties": {},
"mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) ACTION",
"mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NOT ANNOTATOR) ACTION",
"mediaSelectors": [
{
"type": "JSON_PATH",
Expand Down Expand Up @@ -406,7 +406,7 @@ pipeline. The first stage performs language identification. The second performs
{
"mediaUri": "file:///opt/mpf/share/remote-media/test-csv-translation.csv",
"properties": {},
"mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) ACTION",
"mediaSelectorsOutputAction": "ARGOS TRANSLATION (WITH FF REGION AND NOT ANNOTATOR) ACTION",
"mediaSelectors": [
{
"type": "CSV_COLS",
Expand Down
Binary file modified docs/docs/img/derivative-media-pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 8 additions & 2 deletions docs/site/Artifact-Extraction-Guide/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -279,8 +279,14 @@ <h1 id="introduction">Introduction</h1>
<h1 id="artifact-extraction-properties">Artifact Extraction Properties</h1>
<p>The choice of which artifacts to extract is highly configurable using the following properties.</p>
<ul>
<li><code>ARTIFACT_EXTRACTION_POLICY</code>: This property sets the high level policy controlling artifact extraction. It must have
one of the following values:<ul>
<li>
<p><code>SUPPRESS_TRACKS</code>: When an action has this property set to <code>true</code>, no artifacts for that action
will be extracted and none of the other properties listed below will have any effect.</p>
</li>
<li>
<p><code>ARTIFACT_EXTRACTION_POLICY</code>: This property sets the high level policy controlling artifact extraction. It must have
one of the following values:</p>
<ul>
<li><code>NONE</code>: No artifact extraction will be performed.</li>
<li><code>VISUAL_TYPES_ONLY</code>: Extract artifacts only for tracks associated with a "visual" data type. Visual data types
include <code>IMAGE</code> and <code>VIDEO</code>.</li>
Expand Down
Loading