Skip to content

Fix trace-file accumulation and Data Retention job failure (#972)#973

Merged
erikdarlingdata merged 1 commit into
devfrom
feature/972-trace-file-cleanup
May 21, 2026
Merged

Fix trace-file accumulation and Data Retention job failure (#972)#973
erikdarlingdata merged 1 commit into
devfrom
feature/972-trace-file-cleanup

Conversation

@erikdarlingdata
Copy link
Copy Markdown
Owner

Problem

The trace-file cleanup added in v2.11.0 (#951) failed the PerformanceMonitor - Data Retention Agent job on every run with Msg 22049 once any Monitor_LongQueries_*.trc files existed.

Root causes found while investigating #972:

  1. xp_delete_file cannot delete .trc files at all — it only accepts SQL Server backup files and Maintenance Plan report files, and validates the file header. The [BUG] Trace files never get cleaned up #951 cleanup could never have worked.
  2. The call passed a wildcard path as the folder argument, which raised an uncatchable Msg 22049 (extended-proc errors bypass TRY...CATCH) that failed the whole Agent job step.
  3. The trace files accumulate because scheduled_master_collector issued RESTART every cycle (tearing down the trace and spawning a fresh timestamped one), and the trace was created with no rollover file-count cap.

Changes

  • config.data_retention — removed the broken xp_delete_file block (the crash).
  • collect.trace_management_collector — new @max_files parameter (default 5) → sp_trace_create @filecount, so SQL Server prunes old .trc files itself as the trace rolls. START now also replaces an unbounded trace left by an older version, so the fix self-heals without waiting for a SQL Server restart.
  • scheduled_master_collector — calls START instead of RESTART; keeps one bounded trace running instead of orphaning files every cycle.
  • tools/Remove-OrphanedTraceFiles.ps1 — new one-time cleanup for .trc files left on disk by versions ≤ 2.11.0; referenced from the README troubleshooting section.

No version bump (release-time step). CHANGELOG.md updated under [Unreleased].

Test plan

Tested live on SQL Server 2016 and 2019:

  • Collector: @max_files = 1 → validation error; START → trace created with max_files = 5 and rollover on; START against an unbounded trace → replaced with a bounded one; START against a bounded trace → idempotent no-op; RESTART / STATUS / STOP all work.
  • config.data_retention runs SUCCESS — Cleaned 51 tables, no Msg 22049.
  • Remove-OrphanedTraceFiles.ps1: -WhatIf preview + real run deleted 181 (2019) / 280 (2016) orphaned files, correctly skipping the running trace's file and locked files.
  • @filecount rollover-delete behavior verified against MS Learn sp_trace_create docs.
  • All three existing callers of trace_management_collector use named parameters, so the new parameter is safe.

🤖 Generated with Claude Code

xp_delete_file cannot delete SQL Trace (.trc) files - it only accepts
backup files and Maintenance Plan reports and validates the header - so
the #951 trace cleanup in config.data_retention never worked, and its
malformed wildcard path raised an uncatchable Msg 22049 that failed the
Data Retention Agent job on every run.

- Remove the broken xp_delete_file block from config.data_retention.
- collect.trace_management_collector now creates the trace with a
  rollover file-count cap (@filecount, via the new @max_files param),
  so SQL Server prunes old .trc files itself. START also replaces an
  unbounded trace left by an older version, so the fix self-heals
  without waiting for a SQL Server restart.
- scheduled_master_collector calls START instead of RESTART, so it no
  longer tears the trace down and orphans its files every cycle.
- Add tools/Remove-OrphanedTraceFiles.ps1 to sweep trace files left on
  disk by versions <= 2.11.0; document it in the README troubleshooting
  section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erikdarlingdata erikdarlingdata merged commit 0fb8fec into dev May 21, 2026
6 checks passed
@erikdarlingdata erikdarlingdata deleted the feature/972-trace-file-cleanup branch May 21, 2026 14:15
@mike-hodgson-icon
Copy link
Copy Markdown

I've been doing some troubleshooting today and was about to add similar comments (to yours) to #951 (but you beat me to it on #972). Sorry about suggesting xp_delete_file; I didn't know it was hobbled to only work with SQL backup files (you learn something new every day).

I did upgrade the dashboard (full) to 2.11.0 a few days ago and then used the dashboard to upgrade a few of my targeted SQL servers (two SQL 2019 and two SQL 2022 servers) but none of them got the new version of the config.data_retention proc. The PerformanceMonitor database on each of them still has the older version of the proc with the xp_delete_file code.

@erikdarlingdata
Copy link
Copy Markdown
Owner Author

@
Quick heads-up on the workflow: the Dashboard binary update (via Velopack auto-update, or a full re-install of the Dashboard app) is separate from the script update that runs against each monitored SQL Server. Velopack only swaps the .exe/.dll files on your workstation — it does not re-run the install scripts against your targets. To push the new procs out to a server you have to trigger the per-server upgrade explicitly (Manage Servers → pick the server → Install/Upgrade).

That likely explains what youre seeing: the Dashboard binary is current but the procs on your instances are still whatever the script workflow last installed.

A few things that would help pin it down:

  1. What does the Dashboard report as its version (title bar / Help → About) — 2.11.0 or something like 2.11.0-nightly.YYYYMMDD? The PR Fix trace-file accumulation and Data Retention job failure (#972) #973 fix is currently only in dev / nightly, not in the v2.11.0 release tag, so even a correctly script-upgraded server on v2.11.0 will still have the xp_delete_file block.
  2. When you "used the dashboard to upgrade", did you open Manage Servers and click Install/Upgrade on each of the four targets, or was it just the Dashboard app self-updating?
  3. On one of the affected servers, can you run this and share the top few rows?
    SELECT TOP 5
        installer_version,
        installer_info_version,
        installation_date,
        installation_type,
        installation_status
    FROM PerformanceMonitor.config.installation_history
    ORDER BY installation_date DESC;
    That will show exactly which installer version last touched the database and when.
    @

@mike-hodgson-icon
Copy link
Copy Markdown

Sorry, I didn't explain myself very well. When I said I "used the dashboard to upgrade", what I meant was that I did: Manage Servers | picked a server | Check for Updates | Upgrade Now.
image

The dashboard binaries that I have installed are v2.11.0 (not a nightly revision). That will be why config.data_retention is still the old version (with the #951 code). My bad. I'll just wait for v2.12.0 (or perhaps grab a nightly build).

For completeness, this is the recent install history on one of the affected SQL instances:
image

Coincidentally, v2.11.0 was the first time I used the PerformanceMonitorDashboard-dashboard-Setup.exe to install the dashboard (previously I just unzipped directly from PerformanceMonitorDashboard-<version>.zip and ran the exe straight from the unzipped dir). I briefly wondered if PerformanceMonitorDashboard-dashboard-Setup.exe was missing the new code, until I read your explanation above. I was also considering using PerformanceMonitorInstaller-2.11.0 to upgrade the procs on the SQL instances, but I checked the 43_data_retention.sql script and the upgrade scripts and could quickly see that wouldn't work.

All good, I'll just be patient. 🙂
Thanks for taking the time to reply (and for all the work you're putting into this tool - it really is excellent).

@erikdarlingdata
Copy link
Copy Markdown
Owner Author

@
No apology needed at all — your diagnosis is spot on. The per-server upgrade (Manage Servers → Check for Updates → Upgrade Now) did exactly what it should have; the only reason config.data_retention is still the old version is that the #973 fix is currently sitting in dev/nightly and hasn't made it into a tagged release yet. A correctly-upgraded server on v2.11.0 will still have the xp_delete_file block until v2.12.0 ships.

Your instincts were right on both counts, too: PerformanceMonitorDashboard-dashboard-Setup.exe isn't missing anything (it's just the Velopack-friendly installer wrapper around the same binaries), and the PerformanceMonitorInstaller-2.11.0 package wouldn't have helped either — the fixed 43_data_retention.sql and the new upgrade script only exist on dev right now.

So waiting for v2.12.0 is the cleanest path, but if the failing Data Retention job is noisy in the meantime, grabbing a nightly and re-running the per-server upgrade will get you the fix immediately. The tools/Remove-OrphanedTraceFiles.ps1 script in the PR will also clean up the .trc files already on disk.

Thanks for the careful investigation and the kind words — genuinely appreciated. 🙂
@

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants