Skip to content

Content Understanding GA SDK for Java#47952

Open
changjian-wang wants to merge 188 commits intomainfrom
cu_sdk/ga
Open

Content Understanding GA SDK for Java#47952
changjian-wang wants to merge 188 commits intomainfrom
cu_sdk/ga

Conversation

@changjian-wang
Copy link
Member

@changjian-wang changjian-wang commented Feb 10, 2026

PR #47952 Review Guide: Content Understanding GA SDK for Java

Executive Summary

This PR introduces the brand new GA release of the azure-ai-contentunderstanding Java SDK. Since this is a new package, all 191 files are additions (+43,291 lines). The package includes:

  1. Customizations (1 file, 1,982 lines): ContentUnderstandingCustomizations.java — JavaParser AST transformations applied post-generation. This is the source of truth for all hand-authored classes and code modifications.
  2. Clients (6 files, 4,668 lines): Sync/async clients, builder, service version — generated then customized.
  3. Models (75 files, 12,703 lines): Public model classes — mostly generated, 7 are hand-authored via customization.
  4. Implementation (11 files, 7,619 lines): Internal REST client, polling strategies, request models — fully generated.
  5. Samples (32 files, 6,249 lines): 16 scenarios × sync + async pairs, plus binary resources.
  6. Tests (38 files, 8,412 lines): 3 unit tests + 34 recorded sample tests + test base class.
  7. Infrastructure (16 files, 1,238 lines): CI, Bicep, CODEOWNERS, README, CHANGELOG, Maven config.

Key customizations in ContentUnderstandingCustomizations.java:

  • Hand-authored model classes: ContentSource, DocumentSource, AudioVisualSource, PointF, RectangleF, Rectangle, ContentRange
  • Typed getValue() on all ContentField subclasses
  • Duration getters wrapping raw millisecond fields on time-based models
  • Flattened convenience method parameters (no wrapper request objects)
  • Internal models hidden from public API

Changes This Week (since Monday Feb 24)

If you've already reviewed an earlier version of this PR, here's what changed this week across 18 commits touching 95+ files. The changes fall into 6 categories:

1. ContentSource hierarchy (new feature)

Added hand-authored ContentSource, DocumentSource, AudioVisualSource, PointF, RectangleF, Rectangle classes with parsing logic. Added getSources() to ContentField (consistent with .NET Sources property).

  • customization/.../ContentUnderstandingCustomizations.java — added ContentSource, DocumentSource, AudioVisualSource, PointF, RectangleF, Rectangle classes + getSources() customization
  • models/ContentSource.java, DocumentSource.java, AudioVisualSource.java — generated output of the above classes
  • models/PointF.java, RectangleF.java, Rectangle.java — generated output of the above geometry types
  • models/ContentField.java — added getSources() method
  • tests/ContentSourceTest.java — new unit tests

2. Duration property customizations (new feature)

Hid raw *Ms() getters (package-private) on AudioVisualContent, AudioVisualContentSegment, TranscriptPhrase, TranscriptWord. Added Duration-returning getters (getStartTime(), getEndTime(), etc.). Removed getTimeMs() from AudioVisualSource.

  • customization/.../ContentUnderstandingCustomizations.javacustomizeDurationProperties() method
  • models/AudioVisualContent.java — hidden getStartTimeMs/getEndTimeMs/getCameraShotTimesMs/getKeyFrameTimesMs, added Duration getters
  • models/AudioVisualContentSegment.java, TranscriptPhrase.java, TranscriptWord.java — same pattern
  • models/ContentRange.javaDuration-based factory methods (timeRange, timeRangeFrom)
  • tests/DurationCustomizationTest.java, ContentRangeTest.java — new tests

3. Type renames from TypeSpec GA update (bulk rename)

AnalyzeInputAnalysisInput, AnalyzeResultAnalysisResult, MediaContentAnalysisContent, plus Content-prefixed field types. This touched 70 files but is mostly mechanical find-and-replace across models, samples, and tests.

  • models/Analysis*.java, Content*Field.java — renamed types
  • All Sample*.java and Sample*Test.java — updated type references
  • tsp-location.yaml — updated TypeSpec commit

4. Property renaming & parameter fixes

  • ContentAnalyzerConfig.set*Enabled() — property rename (touched 36 files: config model, all samples using analyzers, all corresponding tests)
  • *Request1 bogus parameter name fix — cleaned up AnalyzeRequest1/GrantCopyAuthorizationRequest1 parameter names in client convenience methods

5. README update

  • README.md — fixed AnalyzeResultAnalysisResult (2 occurrences)

Files to focus on if you've already reviewed earlier

Priority Files Reason
High ContentUnderstandingCustomizations.java Major additions: ContentSource hierarchy, Duration customizations, parameter fixes
High ContentSource.java, DocumentSource.java, AudioVisualSource.java New hand-authored parsing classes
High ContentField.java Added getSources()
Medium ContentRange.java New Duration-based factory methods
Medium PointF.java, RectangleF.java, Rectangle.java New geometry types
Medium ContentSourceTest.java, DurationCustomizationTest.java, ContentRangeTest.java New unit tests
Low All other model/sample/test changes Mechanical renames from TypeSpec update

P0 — Must Review (Public API, customizations, README)

File Lines Reason
customization/.../ContentUnderstandingCustomizations.java 1,982 Source of truth for all post-gen customizations — AST transforms, field value accessors, Duration wrappers. Contains 7 hand-authored classes as string constants — review their output files in models/ instead (see P2 section).
README.md 460 User-facing documentation — getting started, authentication, key concepts, code examples, troubleshooting
.../ContentUnderstandingClient.java 2,109 Sync client — public API surface with convenience methods for analyze, CRUD analyzers, copy, result files
.../ContentUnderstandingAsyncClient.java 2,141 Async client — mirrors sync client with Mono/Flux return types
.../ContentUnderstandingClientBuilder.java 356 Client builder — credential, endpoint, pipeline, retry, service version config
.../ContentUnderstandingServiceVersion.java 40 API version enum — should have V2025_11_01
.../models/ContentField.java 283 Base field class — polymorphic deserialization hub for all content field types
.../models/AnalysisResult.java 199 Analysis result — top-level response from analyze operations
.../models/ContentAnalyzerConfig.java 754 Analyzer config — complex model for creating/updating analyzers with field schemas
CHANGELOG.md 12 Release notes — verify correct version and feature list

Total P0: 10 files, ~8,336 lines


P1 — Should Review (Key samples, key models, test infra, CI config)

File Lines Reason
Sample02_AnalyzeUrl.java 284 Flagship sample — full analyze workflow with field extraction and content traversal
Sample03_AnalyzeInvoice.java 191 Invoice extraction sample — common customer scenario
Sample04_CreateAnalyzer.java 251 Custom analyzer sample — shows analyzer creation with field schema
Sample16_CreateAnalyzerWithLabels.java 248 Labels sample — new GA feature for labeled training data
Sample14_CopyAnalyzer.java 224 Copy analyzer sample — cross-resource copy workflow
.../models/ContentAnalyzer.java 603 Analyzer model — core entity with status, config, usage details
.../models/AnalysisContent.java 300 Content base class — polymorphic parent of DocumentContent / AudioVisualContent
.../models/DocumentContent.java 353 Document content — pages, tables, figures, paragraphs
.../models/AudioVisualContent.java 338 AV content — transcript segments, phrases
.../models/ContentFieldDefinition.java 524 Field definition — detailed field config with description, examples
.../models/ContentFieldSchema.java 286 Field schema — defines field types in analyzer config
ci.yml 46 CI pipeline — verify correct package references and test configuration
test-resources.bicep 139 Azure deployment — Bicep template for test resources
test-resources-post.ps1 414 Post-deploy script — model deployments and prebuilt analyzer setup
pom.xml (package-level) 75 Maven config — dependencies, plugins, version

Total P1: 15 files, ~4,276 lines


P2 — Skim / Low Priority

Hand-authored model classes (7 files, 909 lines)

These classes are written as string constants inside ContentUnderstandingCustomizations.java and emitted as files during code generation. Review the generated output files instead — they are real .java files with full syntax highlighting, which is much easier to read than the escaped string constants in the customization class:

Review this file (in models/) Lines Purpose
ContentSource.java 146 Base content source — parse(), toRawString(), abstract hierarchy
DocumentSource.java 150 Document source — page number, polygon, bounding box parsing
AudioVisualSource.java 147 AV source — time (Duration), bounding box parsing
ContentRange.java 211 Range builder — page(), pages(), timeRange(Duration), combine()
PointF.java 68 Float point (x, y)
RectangleF.java 95 Float rectangle (4 corners)
Rectangle.java 92 Integer rectangle (4 corners)

Tip: The source of truth is the string constants in ContentUnderstandingCustomizations.java, but the content is identical to the output files above. Reviewing the output files is the intended workflow — the customization file is just the delivery mechanism.

Remaining generated models (68 files, ~11,794 lines)

All other files in models/. Auto-generated from TypeSpec — spot-check a few for correctness but full review is low value.

Generated implementation (11 files, 7,619 lines)

ContentUnderstandingClientImpl.java (6,574 lines) is the main REST client. Polling strategies and helpers are also generated. Low review value.

Remaining samples (27 files, ~5,051 lines)

Sync + async pairs for: Sample00_UpdateDefaults, Sample01_AnalyzeBinary, Sample05_CreateClassifier, Sample06_GetAnalyzer, Sample08_UpdateAnalyzer, Sample09_DeleteAnalyzer, Sample10_AnalyzeConfigs, Sample11_AnalyzeReturnRawJson, Sample12_GetResultFile, Sample13_DeleteResult, Sample15_GrantCopyAuth, plus async variants of P1 samples.

Tests (38 files, 8,412 lines)

  • 3 unit tests: ContentRangeTest (185), ContentSourceTest (279), DurationCustomizationTest (146)
  • 34 sample tests: Recorded playback tests mirroring each sample (sync + async)
  • ContentUnderstandingClientTestBase.java (73): Shared test setup

Infrastructure (low-churn files)

File Lines Notes
.github/CODEOWNERS +6 Adds contentunderstanding ownership
eng/versioning/version_client.txt +1 Registers package version
pom.xml (root) +1 Adds module reference
pom.xml (service) 15 Service-level parent POM
tests.yml 14 Test pipeline config
tsp-location.yaml 4 TypeSpec commit reference
.gitignore 6 Standard ignores
cspell.json 18 Spell check word list
assets.json 6 Test recording asset reference
customization/pom.xml 21 Customization build config
src/main/resources/ (3 files) 135 API review metadata, package properties

Sample resources (binary, not reviewable)

  • mixed_financial_docs.pdf, sample_document_features.pdf, sample_invoice.pdf
  • receipt_labels/ — 2 receipt images + labels JSON + result JSON

Review Tips

  1. Start with ContentUnderstandingCustomizations.java — this is the most critical file. It defines all hand-authored classes and AST transformations. Understanding it makes the rest of the PR much clearer.
  2. Compare sync/async clients — they should mirror each other. Review one in detail, then spot-check the other for consistency.
  3. Skip generated model files unless a specific model looks wrong — they are regenerated from TypeSpec on every tsp-client update.
  4. Samples come in sync/async pairs — review the sync version, then verify the async version follows the same pattern.
  5. Test files mirror samples — each sample has a corresponding recorded test. Verify test assertions are meaningful.

azure-sdk and others added 30 commits December 3, 2025 22:19
… from .NET SDK

- Sample00_ConfigureDefaults: Demonstrates configuration management (get/update defaults)
- Sample01_AnalyzeBinary: Binary PDF analysis from local file
- Sample02_AnalyzeUrl: Analyze documents from URL
- Sample03_AnalyzeInvoice: Extract structured invoice fields with nested objects and arrays
- Sample04_CreateAnalyzer: Create and use custom analyzer with field schema (Extract/Generate/Classify methods)

Key features:
- All samples use DefaultAzureCredentialBuilder for authentication
- Environment variable based configuration (ENDPOINT)
- Comprehensive JUnit 5 tests with assertions
- GitHub public URLs for test data
- Proper field access patterns with type casting (ContentField, StringField, NumberField, ObjectField, ArrayField)
- All tests passing (6/6 = 100% success rate)

Technical implementation:
- Fixed API differences from C# SDK (ContentSpan, ContentField, 5-parameter beginAnalyze)
- Proper null checking and type casting for all field access
- Detailed validation assertions for all document properties
- Clean resource management with @AfterEach cleanup

Module-info.java formatting cleanup included.
- Sample05_CreateClassifier: Create classifier analyzer with multiple classification fields (document_type, industry, urgency)
- Sample06_GetAnalyzer: Get analyzer information including configuration and field schema

Key features:
- Sample05: Demonstrates classification-only analyzer with 3 classifiers
- Sample06: Shows how to retrieve and inspect analyzer properties including prebuilt analyzers
- Fixed API usage: getAnalyzerId(), getCreatedAt(), getLastModified At() instead of getId(), getCreatedDateTime(), getUpdatedDateTime()
- Comprehensive field schema inspection with all 31 prebuilt-invoice fields
- All tests passing with real Azure service
- Sample07_ListAnalyzers: List and filter all available analyzers (prebuilt and custom)
  * testListAnalyzersAsync: Lists all 134 analyzers (87 prebuilt, 47 custom)
  * testListReadyAnalyzersAsync: Filters for ready analyzers only
- Sample08_UpdateAnalyzer: Update existing analyzer properties
  * Demonstrates updating description, configuration, and field schema
  * Uses @beforeeach to create test analyzer and @AfterEach for cleanup
  * Shows how to add new fields while preserving existing ones

All tests passing with real Azure service
Fixed Sample08_UpdateAnalyzer to avoid 409 conflict error:
- Delete existing analyzer before recreating with updated configuration
- Added note about using updateAnalyzerWithResponse for atomic updates in production
- All 12 tests now passing (Sample00-08 with multiple test methods)

Test results: 12/12 passed (100% success rate)
…iables and Improve Test Patterns

- Updated environment variable names from "ENDPOINT" and "CONTENTUNDERSTANDING_API_KEY" to "CONTENTUNDERSTANDING_ENDPOINT" and "AZURE_CONTENT_UNDERSTANDING_KEY" across multiple sample test files.
- Modified sample tests to load local files instead of using publicly accessible URLs for document analysis.
- Enhanced assertions and logging for better clarity and debugging.
- Improved API usage patterns in tests for creating, copying, and deleting analyzers, including async patterns.
- Added model mappings for analyzers in relevant samples to demonstrate configuration capabilities.
…e validation of source and copied analyzers
… Azure Credential Authentication

- Updated Sample03_AnalyzeInvoice, Sample04_CreateAnalyzer, Sample05_CreateClassifier, Sample06_GetAnalyzer, Sample07_ListAnalyzers, Sample08_UpdateAnalyzer, Sample09_DeleteAnalyzer, Sample10_AnalyzeConfigs, Sample11_AnalyzeReturnRawJson, Sample12_GetResultFile, Sample13_DeleteResult, Sample14_CopyAnalyzer, Sample15_GrantCopyAuth, and Sample16_CreateAnalyzerWithLabels to include logic for initializing the Content Understanding client with either an API key or the Default Azure Credential.
- Added assertions to verify client initialization in each sample.
- Improved code readability and maintainability by consolidating client creation logic.
- Sample12_GetResultFile: Demonstrates how to retrieve keyframe images from video analysis operations.
- Sample13_DeleteResult: Shows how to delete analysis results after they are no longer needed.
- Sample14_CopyAnalyzer: Illustrates how to copy an analyzer within the same resource.
- Sample15_GrantCopyAuth: Demonstrates granting copy authorization for cross-resource analyzer copying.
- Sample16_CreateAnalyzerWithLabels: Shows how to create an analyzer with labeled training data from Azure Blob Storage.
- Delete 13 @disabled test files (replaced by Sample tests)
- Modify Sample00-Sample16 to extend ContentUnderstandingClientTestBase
- Add testResourceNamer for reproducible random IDs in PLAYBACK mode
- Remove problematic sanitizers (AZSDK2003, AZSDK2030, AZSDK3423, AZSDK3430, AZSDK3493)
- Configure maven-surefire-plugin to include Sample*.java
- Use AZURE_CONTENT_UNDERSTANDING_ENDPOINT env var (matches .NET naming)
Exclude src/samples/.../samples/Sample*.java standalone examples from test execution.
- Fixed URI mismatch issue where recorded URLs had double slashes (//contentunderstanding)
- Updated assets.json to point to new recordings tag (3de1635cfc)
- All 23 tests pass in PLAYBACK mode
yungshinlintw and others added 2 commits February 27, 2026 18:08
- Changed the visibility of the getSource method in ContentField to package-private and updated its implementation to use the new getSources method.
- Updated ContentSource to rename parseSource to parseAll for clarity and adjusted related documentation.
- Modified sample and test files to reflect the new source handling, demonstrating the use of getSources for typed access to content sources.
- Enhanced sample outputs to provide detailed information about document sources, including page numbers and bounding boxes.
@yungshinlintw yungshinlintw enabled auto-merge (squash) February 28, 2026 00:57
Copy link
Member

@weidongxu-microsoft weidongxu-microsoft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, lib would be first released as a preview/beta? <-- as long as Arch is fine with 1.0.0

…t oepration Id

- Removed the operationId field and its associated helper class, simplifying the model.
- Updated PollingUtils and polling strategies to eliminate the need for operationId extraction.
- Adjusted sample and test files to reflect changes in accessing operation ID using the getId() method instead of getOperationId().
- Enhanced documentation to clarify the new approach for retrieving operation IDs.
…tructor and updateDefaults methods

- Made the ContentUnderstandingDefaults constructor public to allow the creation of instances in updateDefaults methods.
- Added convenience methods for updateDefaults that accept typed objects instead of BinaryData, addressing limitations of the Java emitter.
- Updated documentation to clarify the behavior of the new methods and the rationale behind the changes.
Copy link
Member

@weidongxu-microsoft weidongxu-microsoft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One may want to add a README.md in src/samples folder. E.g. https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-blob/src/samples/README.md

This would help publish these samples to Azure Sample Browser.

This is likely not required though.

yungshinlintw and others added 8 commits February 28, 2026 07:39
Replace 'import static org.junit.jupiter.api.Assertions.*' with
explicit imports (assertEquals, assertNotNull, assertTrue, etc.)
across all 37 test and sample test files per checkstyle requirements.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove Dtest, Dsurefire, dotenv, DAZURE, pytest - none are used in this project.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add src/samples/README.md following the Azure SDK for Java convention
to enable publishing samples to the Azure Sample Browser.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.