Skip to content

improve observability over exceptions#1730

Open
arthurpassos wants to merge 1 commit intoantalya-26.1from
export_partition_refactor_exceptions_observability
Open

improve observability over exceptions#1730
arthurpassos wants to merge 1 commit intoantalya-26.1from
export_partition_refactor_exceptions_observability

Conversation

@arthurpassos
Copy link
Copy Markdown
Collaborator

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Observability over exceptions on export partition operations was broken and not reliable. This PR simplifies it by making it a single JSON array instead of multiple paths in zookeeper. Another change is that now system.replicated_partition_exports returns only local information, it does not reach zookeeper.

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

@arthurpassos arthurpassos added antalya port-antalya PRs to be ported to all new Antalya releases labels May 5, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Workflow [PR], commit [99d816e]

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99d816eb7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +377 to +380
if (leaf_exists)
ops.emplace_back(zkutil::makeSetRequest(last_exception_path, entry.toJsonString(), -1));
else
{
LOG_INFO(log, "ExportPartition: Exceptions per replica path does not exist, will create it");
ops.emplace_back(zkutil::makeCreateRequest(exceptions_per_replica_path, "", zkutil::CreateMode::Persistent));
ops.emplace_back(zkutil::makeCreateRequest(count_path, "1", zkutil::CreateMode::Persistent));
ops.emplace_back(zkutil::makeCreateRequest(last_exception_path, "", zkutil::CreateMode::Persistent));
ops.emplace_back(zkutil::makeCreateRequest(last_exception_path / "part", part_name, zkutil::CreateMode::Persistent));
ops.emplace_back(zkutil::makeCreateRequest(last_exception_path / "exception", exception_message, zkutil::CreateMode::Persistent));
}
ops.emplace_back(zkutil::makeCreateRequest(last_exception_path, entry.toJsonString(), zkutil::CreateMode::Persistent));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Create missing last_exception container for legacy exports

For exports that were created before this change, ZooKeeper contains .../exceptions_per_replica but not .../last_exception. In that upgrade scenario, this branch issues Create(.../last_exception/<replica>) without creating the parent path first, so Keeper returns ZNONODE and the surrounding tryMulti fails atomically. Because appendExceptionOps is used inside failure-handling multis (part failure bookkeeping, commit failure retries, and timeout-to-KILLED), those state transitions can be skipped repeatedly for in-flight legacy tasks, leaving them stuck in PENDING during failures.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya port-antalya PRs to be ported to all new Antalya releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant