feat: Separate Runtime Statistics Collection from UI Updates#4205
feat: Separate Runtime Statistics Collection from UI Updates#4205kunwp1 wants to merge 10 commits intoapache:mainfrom
Conversation
Xiao-zhen-Liu
left a comment
There was a problem hiding this comment.
LGTM, left minor comments and some questions. Tested and can verify the size changes of the persisted runtime stats by adjusting this new parameter.
...xera/amber/engine/architecture/controller/promisehandlers/QueryWorkerStatisticsHandler.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/texera/amber/engine/architecture/controller/ControllerTimerService.scala
Show resolved
Hide resolved
...xera/amber/engine/architecture/controller/promisehandlers/QueryWorkerStatisticsHandler.scala
Show resolved
Hide resolved
|
@Xiao-zhen-Liu Can you do a one more pass of the review? @chenlica and I discussed the design and decided to keep the two control messages independent (one for UI updates, one for persistence) to avoid coupling their intervals together. To address the concern about increased worker query frequency, I added an optimization in |
Xiao-zhen-Liu
left a comment
There was a problem hiding this comment.
LGTM, thanks for the new changes. Only have a minor clarification about the new behavior.
| if (globalQueryStatsOngoing && msg.filterByWorkers.isEmpty) { | ||
| // A query is already in-flight: serve the last completed query's cached data, | ||
| // or drop silently if no prior query has finished yet. | ||
| if (lastWorkerQueryTimestampNs > 0) forwardStats(msg.updateTarget) |
There was a problem hiding this comment.
I'm trying to understand the behavior of this change. Does this mean only the concurrent requests before the first globalQuery finishes will be dropped, and after the first globalQuery of a workflow finishes, all subsequent concurrent requests of an ongoing globalQuery will be served from cache? (Previously, any concurrent request will be dropped.)
| forwardStats(msg.updateTarget) | ||
| // Record the completion timestamp before releasing the lock so that any timer | ||
| // firing in between sees a valid cache entry rather than triggering a redundant query. | ||
| if (globalQueryStatsOngoing) { |
There was a problem hiding this comment.
Can the completion of filtered requests also trigger this lock-release?
What changes were proposed in this PR?
This PR introduces a new configuration parameter
runtime-statistics-persistence-intervalto independently control the frequency of runtime statistics persistence, separate from the UI update frequency (status-update-interval). Previously, both UI updates and runtime statistics persistence were controlled by a single parameterstatus-update-interval. This means frequent UI updates (e.g., 500ms) caused excessive statistics writes to storage. This change allows independent control:status-update-interval: Controls how often the frontend UI refreshes (default: 500ms)runtime-statistics-persistence-interval: Controls how often statistics are persisted to storage (default: 2000ms)Do two timers mean more frequent worker queries?
No. The controller tracks the timestamp of the last completed full-graph worker query and uses
min(status-update-interval, runtime-statistics-persistence-interval)as a freshness threshold. When a timer fires, if the elapsed time since the last query is within this threshold, the controller forwards stats from cache without querying workers — so the faster timer drives all real worker queries and the slower timer always reuses the result. If a query is already in-flight when the second timer fires, the controller serves stats from the previous completed query's cache. Cache reuse applies to timer-triggered queries only; event-triggered queries (e.g., from worker completion events) always proceed to real worker RPCs.Changes
runtime-statistics-persistence-intervalparameter (default: 2000ms) inapplication.confStatisticsUpdateTargetenum (UI_ONLY,PERSISTENCE_ONLY,BOTH_UI_AND_PERSISTENCE) toQueryStatisticsRequestRuntimeStatisticsPersistevent for statistics-only updates;ExecutionStatsUpdatenow handles UI-only updatesQueryWorkerStatisticsHandlerroutes to the appropriate event based onStatisticsUpdateTargetQueryWorkerStatisticsHandler: when the second timer fires, the controller checks whether worker stats were already fetched recently (withinmin(status-update-interval, runtime-statistics-persistence-interval)). If so, it forwards the cached stats to the appropriate sink (UI or persistence) without issuing any worker RPCs. If a query is already in-flight, cached stats from the previous completed query are forwarded.Any related issues, documentation, discussions?
Closes #4204
How was this PR tested?
Tested with the following workflow and dataset, change the
runtime-statistics-persistence-intervalparameter to see if the runtime stats size reduces if we increase the parameter value.Iris Dataset Analysis.json
Iris.csv
Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude-4.6