Skip to content

feat(nemo): enable word-level timestamps for ASR models#10297

Open
fqscfqj wants to merge 2 commits into
mudler:masterfrom
fqscfqj:fix/nemo-word-timestamps
Open

feat(nemo): enable word-level timestamps for ASR models#10297
fqscfqj wants to merge 2 commits into
mudler:masterfrom
fqscfqj:fix/nemo-word-timestamps

Conversation

@fqscfqj

@fqscfqj fqscfqj commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Problem

The nemo backend ignores timestamp_granularities and always returns a single segment with start=0 end=0, making word-level timestamps impossible to obtain even though NeMo ASR models (parakeet-tdt, etc.) fully support them via model.transcribe(timestamps=True).

Root Cause

The AudioTranscription method in backend/python/nemo/backend.py never passes timestamps=True to NeMo's transcribe() call, and never extracts the word-level offset data from the Hypothesis.timestamp dict.

Changes

  • _get_stride_seconds() — computes frame duration from the model's preprocessor.window_stride × encoder.subsampling_factor (defaults to 80ms for parakeet-tdt).
  • _build_segments_with_words() — extracts word offsets from the NeMo Hypothesis.timestamp['word'] dict and converts frame indices to nanosecond timestamps compatible with the TranscriptWord / TranscriptSegment protobuf messages.
  • Supports two granularity modes:
    • "word": one TranscriptSegment per word, each with a single TranscriptWord entry.
    • "segment" (default): merges consecutive words into sentence-level segments, splitting at word-level time gaps that exceed a dynamically computed threshold (median gap × 3, clamped to [0.3s, 2.0s]).
  • Populates TranscriptSegment.words with TranscriptWord entries so callers get both segment-level and word-level timing.
  • Only requests timestamps from NeMo when the caller actually asks for them (timestamp_granularities is non-empty), keeping the fast path unchanged for callers that don't need timestamps.

Testing

Tested with nvidia/parakeet-tdt-0.6b-v3 on the JFK "ask not" clip:

curl -X POST http://localhost:8080/v1/audio/transcriptions   -F file=@jfk.wav   -F model=nemo-parakeet-tdt-0.6b   -F 'timestamp_granularities[]=word'   -F response_format=verbose_json

Response (excerpt):

{
  "segments": [
    {"id": 0, "start": 0.24, "end": 0.56, "text": "And", "words": [{"start": 0.24, "end": 0.56, "text": "And"}]},
    {"id": 1, "start": 0.56, "end": 0.88, "text": "so,", "words": [{"start": 0.56, "end": 0.88, "text": "so,"}]},
    ...
  ],
  "text": "And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country."
}

Each word has correct start/end times in seconds.


Related: #10012, #10134

Copilot AI review requested due to automatic review settings June 13, 2026 02:47

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds optional timestamped transcription output by requesting NeMo word timestamps and converting them into TranscriptSegment/TranscriptWord responses.

Changes:

  • Add stride estimation helper for converting NeMo frame offsets into seconds/nanoseconds.
  • Add segment/word timestamp post-processing that can either emit per-word segments or merge words into larger segments.
  • Update AudioTranscription to conditionally request timestamps based on request.timestamp_granularities.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +225 to +233
if want_timestamps:
# Request timestamps from NeMo
results = self.model.transcribe([audio_path], timestamps=True, return_hypotheses=False)

if results and len(results) > 0:
hypotheses = results[0] if isinstance(results[0], list) else results
if hypotheses and len(hypotheses) > 0:
hypothesis = hypotheses[0]

Comment on lines +87 to +99
def _get_stride_seconds(self):
"""Compute the seconds-per-frame stride for the loaded model.

stride = preprocessor_window_stride * encoder_subsampling_factor
"""
try:
preprocessor = self.model.preprocessor
window_stride = preprocessor._cfg.get('window_stride', 0.01)
subsampling_factor = getattr(self.model.encoder, 'subsampling_factor', 8)
return window_stride * subsampling_factor
except Exception:
# Fallback: 80ms per frame (typical for parakeet-tdt)
return 0.08
The nemo backend ignored timestamp_granularities and always returned a
single segment with start=0 end=0, making word-level timestamps
impossible to obtain even though the NeMo models (parakeet-tdt, etc.)
fully support them.

Changes:
- Add _get_stride_seconds() to compute frame duration from the model's
  preprocessor window_stride and encoder subsampling_factor.
- Add _build_segments_with_words() that extracts word offsets from the
  NeMo Hypothesis.timestamp dict and converts frame indices to
  nanosecond timestamps.
- Support 'word' granularity (one segment per word) and 'segment'
  granularity (merge at time-gap boundaries using a dynamic threshold).
- Populate TranscriptSegment.words with TranscriptWord entries so
  callers get both segment-level and word-level timing.
- Only request timestamps from NeMo when the caller actually asks for
  them (timestamp_granularities is non-empty), keeping the fast path
  unchanged for callers that don't need timestamps.

Tested with nvidia/parakeet-tdt-0.6b-v3 on the JFK "ask not" clip:
  curl -X POST /v1/audio/transcriptions \
    -F file=@jfk.wav -F model=nemo-parakeet-tdt-0.6b \
    -F 'timestamp_granularities[]=word' -F response_format=verbose_json
  → each word has correct start/end times in seconds.

Signed-off-by: fqscfqj <fqscfqj@outlook.com>
@fqscfqj fqscfqj force-pushed the fix/nemo-word-timestamps branch from 7c38082 to 7d9689a Compare June 13, 2026 02:50
- Narrow exception handling in _get_stride_seconds to catch only
  AttributeError, KeyError, TypeError instead of bare Exception, and
  emit a warning when falling back to the hardcoded stride.
- Remove explicit return_hypotheses=False when timestamps are requested;
  timestamps=True already forces NeMo to return Hypothesis objects.
- Add a warning when NeMo does not return Hypothesis objects despite
  timestamps being requested.

Signed-off-by: fqscfqj <fqscfqj@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants