feat(nemo): enable word-level timestamps for ASR models by fqscfqj · Pull Request #10297 · mudler/LocalAI

fqscfqj · 2026-06-13T02:47:14Z

Problem

The nemo backend ignores timestamp_granularities and always returns a single segment with start=0 end=0, making word-level timestamps impossible to obtain even though NeMo ASR models (parakeet-tdt, etc.) fully support them via model.transcribe(timestamps=True).

Root Cause

The AudioTranscription method in backend/python/nemo/backend.py never passes timestamps=True to NeMo's transcribe() call, and never extracts the word-level offset data from the Hypothesis.timestamp dict.

Changes

_get_stride_seconds() — computes frame duration from the model's preprocessor.window_stride × encoder.subsampling_factor (defaults to 80ms for parakeet-tdt).
_build_segments_with_words() — extracts word offsets from the NeMo Hypothesis.timestamp['word'] dict and converts frame indices to nanosecond timestamps compatible with the TranscriptWord / TranscriptSegment protobuf messages.
Supports two granularity modes:
- "word": one TranscriptSegment per word, each with a single TranscriptWord entry.
- "segment" (default): merges consecutive words into sentence-level segments, splitting at word-level time gaps that exceed a dynamically computed threshold (median gap × 3, clamped to [0.3s, 2.0s]).
Populates TranscriptSegment.words with TranscriptWord entries so callers get both segment-level and word-level timing.
Only requests timestamps from NeMo when the caller actually asks for them (timestamp_granularities is non-empty), keeping the fast path unchanged for callers that don't need timestamps.

Testing

Tested with nvidia/parakeet-tdt-0.6b-v3 on the JFK "ask not" clip:

curl -X POST http://localhost:8080/v1/audio/transcriptions   -F file=@jfk.wav   -F model=nemo-parakeet-tdt-0.6b   -F 'timestamp_granularities[]=word'   -F response_format=verbose_json

Response (excerpt):

{
  "segments": [
    {"id": 0, "start": 0.24, "end": 0.56, "text": "And", "words": [{"start": 0.24, "end": 0.56, "text": "And"}]},
    {"id": 1, "start": 0.56, "end": 0.88, "text": "so,", "words": [{"start": 0.56, "end": 0.88, "text": "so,"}]},
    ...
  ],
  "text": "And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country."
}

Each word has correct start/end times in seconds.

Related: #10012, #10134

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds optional timestamped transcription output by requesting NeMo word timestamps and converting them into TranscriptSegment/TranscriptWord responses.

Changes:

Add stride estimation helper for converting NeMo frame offsets into seconds/nanoseconds.
Add segment/word timestamp post-processing that can either emit per-word segments or merge words into larger segments.
Update AudioTranscription to conditionally request timestamps based on request.timestamp_granularities.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            if want_timestamps:
+                # Request timestamps from NeMo
+                results = self.model.transcribe([audio_path], timestamps=True, return_hypotheses=False)
+
+                if results and len(results) > 0:
+                    hypotheses = results[0] if isinstance(results[0], list) else results
+                    if hypotheses and len(hypotheses) > 0:
+                        hypothesis = hypotheses[0]
+


+    def _get_stride_seconds(self):
+        """Compute the seconds-per-frame stride for the loaded model.
+
+        stride = preprocessor_window_stride * encoder_subsampling_factor
+        """
+        try:
+            preprocessor = self.model.preprocessor
+            window_stride = preprocessor._cfg.get('window_stride', 0.01)
+            subsampling_factor = getattr(self.model.encoder, 'subsampling_factor', 8)
+            return window_stride * subsampling_factor
+        except Exception:
+            # Fallback: 80ms per frame (typical for parakeet-tdt)
+            return 0.08


The nemo backend ignored timestamp_granularities and always returned a single segment with start=0 end=0, making word-level timestamps impossible to obtain even though the NeMo models (parakeet-tdt, etc.) fully support them. Changes: - Add _get_stride_seconds() to compute frame duration from the model's preprocessor window_stride and encoder subsampling_factor. - Add _build_segments_with_words() that extracts word offsets from the NeMo Hypothesis.timestamp dict and converts frame indices to nanosecond timestamps. - Support 'word' granularity (one segment per word) and 'segment' granularity (merge at time-gap boundaries using a dynamic threshold). - Populate TranscriptSegment.words with TranscriptWord entries so callers get both segment-level and word-level timing. - Only request timestamps from NeMo when the caller actually asks for them (timestamp_granularities is non-empty), keeping the fast path unchanged for callers that don't need timestamps. Tested with nvidia/parakeet-tdt-0.6b-v3 on the JFK "ask not" clip: curl -X POST /v1/audio/transcriptions \ -F file=@jfk.wav -F model=nemo-parakeet-tdt-0.6b \ -F 'timestamp_granularities[]=word' -F response_format=verbose_json → each word has correct start/end times in seconds. Signed-off-by: fqscfqj <fqscfqj@outlook.com>

- Narrow exception handling in _get_stride_seconds to catch only AttributeError, KeyError, TypeError instead of bare Exception, and emit a warning when falling back to the hardcoded stride. - Remove explicit return_hypotheses=False when timestamps are requested; timestamps=True already forces NeMo to return Hypothesis objects. - Add a warning when NeMo does not return Hypothesis objects despite timestamps being requested. Signed-off-by: fqscfqj <fqscfqj@outlook.com>

Copilot AI review requested due to automatic review settings June 13, 2026 02:47

Copilot AI reviewed Jun 13, 2026

View reviewed changes

fqscfqj force-pushed the fix/nemo-word-timestamps branch from 7c38082 to 7d9689a Compare June 13, 2026 02:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(nemo): enable word-level timestamps for ASR models#10297

feat(nemo): enable word-level timestamps for ASR models#10297
fqscfqj wants to merge 2 commits into
mudler:masterfrom
fqscfqj:fix/nemo-word-timestamps

fqscfqj commented Jun 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

fqscfqj commented Jun 13, 2026

Problem

Root Cause

Changes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants