Skip to content

pool: add firefly onStart marker for MoverProtocol-based transfers#8055

Open
ShawnMcKee wants to merge 1 commit intodCache:masterfrom
ShawnMcKee:fix/firefly-tpc-onstart-marker
Open

pool: add firefly onStart marker for MoverProtocol-based transfers#8055
ShawnMcKee wants to merge 1 commit intodCache:masterfrom
ShawnMcKee:fix/firefly-tpc-onstart-marker

Conversation

@ShawnMcKee
Copy link
Copy Markdown
Contributor

Problem

AbstractMoverProtocolTransferService, the base class for MoverProtocol-based transfer services (used by RemoteHttpTransferService for FTS third-party copy movers), never calls transferLifeCycle.onStart(). This means firefly flow-start UDP markers are never emitted for TPC movers, even though:

  • DefaultPostTransferService correctly sends onEnd() markers (using the deriveLocalEndpoint() fallback added in 11.2.3)
  • NettyTransferService (direct client connections via xrootd/HTTP) correctly calls onStart() in its executeMover() method

Impact

ESnet Stardust dashboards show incomplete flow information for TPC transfers. Specifically:

  • TPC onEnd markers are present with correct experiment/activity IDs from transfer tags (e.g., activityId=11, 15, 16 for ATLAS categories)
  • Zero TPC onStart markers are emitted from pool nodes for TPC transfers
  • Direct-read onStart markers (from NettyTransferService) are present but use FQAN fallback (activityId=1) since worker nodes don't send SciTag headers

Evidence from ATLAS AGLT2 pool logs:

# 194 TPC end markers with correct transfer-tag classifier
scitags event=marker state=end protocol=remotehttpsdatatransfer ... transferTag=143 experimentId=2 activityId=15 classifier=transfer-tag

# 0 TPC start markers
grep -c 'protocol=remote.*state=start' → 0

# 141 direct-read start markers, all FQAN/activityId=1
scitags event=marker state=start protocol=https ... experimentId=2 activityId=1 classifier=fqan

Fix

Add a startTransferLifeCycle() call in MoverTask.run() (before I/O begins) that:

  1. Checks _transferLifeCycle is configured and protocol is IP-based
  2. Derives the local endpoint from the remote address using NetworkUtils.getLocalAddress() — the same approach used by DefaultPostTransferService.deriveLocalEndpoint() for onEnd
  3. Calls transferLifeCycle.onStart() with the remote endpoint, derived local endpoint, protocol info, and subject
  4. Handles SocketException gracefully with debug logging

This mirrors the pattern already established in NettyTransferService.executeMover() for direct connections, extending it to all MoverProtocol-based transfers.

Testing

  • Compilation verified (mvn -pl modules/dcache -am compile -DskipTests)
  • Consistent with existing patterns in NettyTransferService and DefaultPostTransferService

Signed-off-by: Shawn McKee smckee@umich.edu

Copy link
Copy Markdown
Member

@kofemann kofemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that suggested change is wrong. It might preserve the pool IP, but (a) uses the wrong port and (b) will work incorrectly on multi-home hosts.

I would suggest updating the RemoteHttpTransferService and passing the TransferLifeCycle to the mover when it is created by the createMoverProtocol method. The RemoteHttpDataTransferProtocol#doGet then should call onStart as soon as the local endpoint is defined.

Move the firefly flow-start marker emission from
AbstractMoverProtocolTransferService.MoverTask into
RemoteHttpDataTransferProtocol, where the actual HTTP connection's
local socket address (correct IP and port) is available.

Previously, the start marker was emitted in MoverTask.run() before
the HTTP connection was established, using NetworkUtils.getLocalAddress()
to derive the local endpoint. This produced the wrong port (0) and
could select the wrong interface on multi-homed hosts.

Now, RemoteHttpTransferService passes the TransferLifeCycle to
RemoteHttpDataTransferProtocol at construction time and sets the
Subject via the overridden createMover(). The protocol calls
onStart() in doGet() and sendFile() immediately after capturing the
local endpoint from HttpInetConnection, which provides the real
bound address and port.

Signed-off-by: Shawn McKee <smckee@umich.edu>
@ShawnMcKee ShawnMcKee force-pushed the fix/firefly-tpc-onstart-marker branch from 21f3ab9 to 873c491 Compare April 8, 2026 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants