Skip to content

[lake/hudi] Introduce Hudi source split planner#3456

Open
fhan688 wants to merge 5 commits into
apache:mainfrom
fhan688:Introduce-Hudi-source-split-planner
Open

[lake/hudi] Introduce Hudi source split planner#3456
fhan688 wants to merge 5 commits into
apache:mainfrom
fhan688:Introduce-Hudi-source-split-planner

Conversation

@fhan688

@fhan688 fhan688 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: #3276

This PR introduces the first planner-only Hudi lake source implementation for Fluss.

Currently, the Hudi lake storage module supports basic Hudi catalog integration, but
HudiLakeStorage#createLakeSource does not return a usable lake source. As a result, Fluss cannot
plan readable lake splits for data already stored in Hudi.

This change adds Hudi source split planning based on a completed Hudi instant. Record reading, limit
pushdown, and Hudi tiering writer support remain explicitly unsupported and can be implemented in
follow-up PRs.

Brief change log

  • Add HudiLakeSource, HudiSplit, HudiSplitPlanner, and HudiSplitSerializer.
  • Wire HudiLakeStorage#createLakeSource to return a Hudi lake source.
  • Add HudiTableInfo to resolve Hudi catalog table metadata, meta client, completed timeline,
    filesystem view, table type, partition fields, and bucket-aware metadata.
  • Plan Hudi splits from the requested snapshot instant after validating that the instant exists in
    the completed Hudi timeline.
  • Support split planning for:
    • COW tables through latest base files before or on the requested instant.
    • MOR tables through latest merged file slices before or on the requested instant.
  • Persist Fluss bucket and partition metadata into Hudi table properties for planner-side recovery.
  • Return bucket -1 for bucket-unaware Fluss log tables.
  • Add and adjust UT coverage for:
    • Hudi lake source planner/serializer wiring.
    • Explicit unsupported limit and record reader behavior.
    • Hudi split serialization version handling.
    • Hudi split planner planning and missing instant behavior.
    • Fluss bucket and partition metadata persisted in Hudi table properties.
    • Partition value extraction from Hudi partition paths.

Tests

  • mvn -pl fluss-lake/fluss-lake-hudi -am -DskipITs -Dcheckstyle.skip=true -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dtest=HudiLakeSourceTest,HudiSplitSerializerTest,HudiTableInfoTest,HudiConversionsTest,HudiSplitPlannerTest test
  • mvn -pl fluss-lake/fluss-lake-hudi -am -DskipITs -Dcheckstyle.skip=true -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false test

The Hudi module test run passed with 54 tests and 0 failures.

API and Format

This PR does not change public Fluss APIs or Fluss storage format.

It adds internal Hudi table properties to preserve Fluss metadata for Hudi source split planning:

  • fluss.bucket.keys
  • fluss.bucket-aware
  • fluss.partition.keys

Documentation

No user-facing documentation is added in this PR because this is a planner-only foundation. Hudi
record reading and Hudi tiering writer support are still not exposed as completed user workflows.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an initial (planner-only) Hudi lake source implementation to Fluss, enabling split planning against a completed Hudi instant so that Fluss can plan readable lake splits for data stored in Hudi.

Changes:

  • Introduce Hudi lake source components for split planning (HudiLakeSource, HudiSplit, HudiSplitPlanner, HudiSplitSerializer) and wire them into HudiLakeStorage#createLakeSource.
  • Add HudiTableInfo to resolve Hudi catalog/table metadata, timeline, filesystem view, partition fields, and bucket-awareness metadata.
  • Add unit tests covering planner wiring, serialization version handling, split planning success/failure cases, and Fluss metadata persistence into Hudi table properties.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
fluss-lake/fluss-lake-hudi/src/main/java/org/apache/fluss/lake/hudi/HudiLakeStorage.java Returns a real Hudi lake source instead of throwing unsupported.
fluss-lake/fluss-lake-hudi/src/main/java/org/apache/fluss/lake/hudi/source/HudiLakeSource.java New planner-only Hudi LakeSource implementation (filters/limit/reader unsupported).
fluss-lake/fluss-lake-hudi/src/main/java/org/apache/fluss/lake/hudi/source/HudiSplit.java New LakeSplit implementation for Hudi file slices + bucket/partition metadata.
fluss-lake/fluss-lake-hudi/src/main/java/org/apache/fluss/lake/hudi/source/HudiSplitPlanner.java Plans splits for COW/MOR tables at a requested completed instant.
fluss-lake/fluss-lake-hudi/src/main/java/org/apache/fluss/lake/hudi/source/HudiSplitSerializer.java Serializer for HudiSplit with explicit version handling.
fluss-lake/fluss-lake-hudi/src/main/java/org/apache/fluss/lake/hudi/utils/HudiConversions.java Persists Fluss bucket/partition metadata into Hudi table properties.
fluss-lake/fluss-lake-hudi/src/main/java/org/apache/fluss/lake/hudi/utils/HudiTableInfo.java New resolver for Hudi catalog/table options, meta client, timeline/view, partitions, bucket-awareness.
fluss-lake/fluss-lake-hudi/src/test/java/org/apache/fluss/lake/hudi/source/HudiLakeSourceTest.java Tests planner/serializer wiring and explicit unsupported behaviors.
fluss-lake/fluss-lake-hudi/src/test/java/org/apache/fluss/lake/hudi/source/HudiSplitPlannerTest.java Tests split planning for a completed instant and failure when instant missing.
fluss-lake/fluss-lake-hudi/src/test/java/org/apache/fluss/lake/hudi/source/HudiSplitSerializerTest.java Tests round-trip serialization and unknown version rejection.
fluss-lake/fluss-lake-hudi/src/test/java/org/apache/fluss/lake/hudi/utils/HudiConversionsTest.java Tests Fluss metadata persistence into Hudi table properties.
fluss-lake/fluss-lake-hudi/src/test/java/org/apache/fluss/lake/hudi/utils/HudiTableInfoTest.java Tests partition value extraction from Hudi partition paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@luoyuxia

luoyuxia commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@XuQianJin-Stars Could you please help review this one?

@fhan688 fhan688 closed this Jun 9, 2026
@fhan688 fhan688 reopened this Jun 9, 2026

@XuQianJin-Stars XuQianJin-Stars left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants