Skip to content

Fix CloudWatch remote logging for ephemeral lifecycle executor#68779

Open
jason810496 wants to merge 2 commits into
apache:mainfrom
jason810496:fix/cloudwatch/remote-logging-k8s-executor
Open

Fix CloudWatch remote logging for ephemeral lifecycle executor#68779
jason810496 wants to merge 2 commits into
apache:mainfrom
jason810496:fix/cloudwatch/remote-logging-k8s-executor

Conversation

@jason810496

@jason810496 jason810496 commented Jun 20, 2026

Copy link
Copy Markdown
Member

Why

While trying to setup cloudwatch remote logging in #68709 in order to persist the logs in real time. I encounter the same errors as above listed issues.

The root cause I found is same as #66475 (comment) pointed out. The configure_logging -> dictConfig -> _clearExistingHandlers call chain shutdown the watchtower handler.

How

I went to the another than #66633, instead of configuring the processors after the dictConfig call. We could make the cloudwatch remote logger itself self-healing by creating the fresh instance if previous instance was shutdown by the dictConfig call but also ensure the .close semantic by guarding with the _close state.

What

Fix the lifecycle issue of cloudwatch remote logging and verify with breeze k8s system test with provider only changes without touching the Task-SDK changes.

Screenshot 2026-06-19 at 11 12 16 PM Screenshot 2026-06-19 at 11 12 46 PM

@jason810496

jason810496 commented Jun 20, 2026

Copy link
Copy Markdown
Member Author

cc @sarvesh371, @seanghaeli Could you verify this patch for your setup when you have a moment? Since it likely #66633 won't catch the 3.3 release (we're close to dev freeze for 3.3), so we might release this provider-only patch first. Thanks.

The streaming CloudWatch handler is rebuilt whenever it reports
shutting_down, so logs survive configure_logging() closing it. But
shutting_down alone cannot tell a mid-task close apart from genuine
teardown, so a record arriving after teardown would spin up an orphan
handler and its background queue thread that nobody flushes or closes.

The supervisor lifecycle makes the two cases distinguishable in time:

  1. configure_logging() builds the handler via remote.processors
     (processors does `_ = self.handler`), registering it in
     logging._handlerList.
  2. The same call then runs dictConfig, whose non-incremental reset
     closes that handler -> watchtower sets shutting_down=True.
  3. Child log records stream through proc -> self.handler, which sees
     shutting_down and rebuilds. This is the case we must keep working.
  4. At the last possible moment _upload_logs() -> upload() -> close()
     flushes; nothing logs after this.

shutting_down is watchtower's flag set by dictConfig (step 2); the new
_closed flag is ours, set only by close() (step 4). dictConfig never
touches _closed, so the rebuild in step 3 still fires, while a late
record after step 4 keeps the closed handler instead of orphaning a new
one. close() on the outer CloudwatchTaskHandler now closes the handler
the IO is currently using rather than the reference captured in
set_context(), which dictConfig may have closed and the IO since rebuilt.
@jason810496 jason810496 force-pushed the fix/cloudwatch/remote-logging-k8s-executor branch from a92793f to f28925a Compare June 20, 2026 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cloudwatch remote logging does not work for ECS Executor Cloudwatch logging broken in Airflow 3.1

1 participant