feat(nemo): enable word-level timestamps for ASR models#10297
Open
fqscfqj wants to merge 2 commits into
Open
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds optional timestamped transcription output by requesting NeMo word timestamps and converting them into TranscriptSegment/TranscriptWord responses.
Changes:
- Add stride estimation helper for converting NeMo frame offsets into seconds/nanoseconds.
- Add segment/word timestamp post-processing that can either emit per-word segments or merge words into larger segments.
- Update
AudioTranscriptionto conditionally request timestamps based onrequest.timestamp_granularities.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+225
to
+233
| if want_timestamps: | ||
| # Request timestamps from NeMo | ||
| results = self.model.transcribe([audio_path], timestamps=True, return_hypotheses=False) | ||
|
|
||
| if results and len(results) > 0: | ||
| hypotheses = results[0] if isinstance(results[0], list) else results | ||
| if hypotheses and len(hypotheses) > 0: | ||
| hypothesis = hypotheses[0] | ||
|
|
Comment on lines
+87
to
+99
| def _get_stride_seconds(self): | ||
| """Compute the seconds-per-frame stride for the loaded model. | ||
|
|
||
| stride = preprocessor_window_stride * encoder_subsampling_factor | ||
| """ | ||
| try: | ||
| preprocessor = self.model.preprocessor | ||
| window_stride = preprocessor._cfg.get('window_stride', 0.01) | ||
| subsampling_factor = getattr(self.model.encoder, 'subsampling_factor', 8) | ||
| return window_stride * subsampling_factor | ||
| except Exception: | ||
| # Fallback: 80ms per frame (typical for parakeet-tdt) | ||
| return 0.08 |
The nemo backend ignored timestamp_granularities and always returned a
single segment with start=0 end=0, making word-level timestamps
impossible to obtain even though the NeMo models (parakeet-tdt, etc.)
fully support them.
Changes:
- Add _get_stride_seconds() to compute frame duration from the model's
preprocessor window_stride and encoder subsampling_factor.
- Add _build_segments_with_words() that extracts word offsets from the
NeMo Hypothesis.timestamp dict and converts frame indices to
nanosecond timestamps.
- Support 'word' granularity (one segment per word) and 'segment'
granularity (merge at time-gap boundaries using a dynamic threshold).
- Populate TranscriptSegment.words with TranscriptWord entries so
callers get both segment-level and word-level timing.
- Only request timestamps from NeMo when the caller actually asks for
them (timestamp_granularities is non-empty), keeping the fast path
unchanged for callers that don't need timestamps.
Tested with nvidia/parakeet-tdt-0.6b-v3 on the JFK "ask not" clip:
curl -X POST /v1/audio/transcriptions \
-F file=@jfk.wav -F model=nemo-parakeet-tdt-0.6b \
-F 'timestamp_granularities[]=word' -F response_format=verbose_json
→ each word has correct start/end times in seconds.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
7c38082 to
7d9689a
Compare
- Narrow exception handling in _get_stride_seconds to catch only AttributeError, KeyError, TypeError instead of bare Exception, and emit a warning when falling back to the hardcoded stride. - Remove explicit return_hypotheses=False when timestamps are requested; timestamps=True already forces NeMo to return Hypothesis objects. - Add a warning when NeMo does not return Hypothesis objects despite timestamps being requested. Signed-off-by: fqscfqj <fqscfqj@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The nemo backend ignores
timestamp_granularitiesand always returns a single segment withstart=0 end=0, making word-level timestamps impossible to obtain even though NeMo ASR models (parakeet-tdt, etc.) fully support them viamodel.transcribe(timestamps=True).Root Cause
The
AudioTranscriptionmethod inbackend/python/nemo/backend.pynever passestimestamps=Trueto NeMo'stranscribe()call, and never extracts the word-level offset data from theHypothesis.timestampdict.Changes
_get_stride_seconds()— computes frame duration from the model'spreprocessor.window_stride×encoder.subsampling_factor(defaults to 80ms for parakeet-tdt)._build_segments_with_words()— extracts word offsets from the NeMoHypothesis.timestamp['word']dict and converts frame indices to nanosecond timestamps compatible with theTranscriptWord/TranscriptSegmentprotobuf messages."word": oneTranscriptSegmentper word, each with a singleTranscriptWordentry."segment"(default): merges consecutive words into sentence-level segments, splitting at word-level time gaps that exceed a dynamically computed threshold (median gap × 3, clamped to [0.3s, 2.0s]).TranscriptSegment.wordswithTranscriptWordentries so callers get both segment-level and word-level timing.timestamp_granularitiesis non-empty), keeping the fast path unchanged for callers that don't need timestamps.Testing
Tested with
nvidia/parakeet-tdt-0.6b-v3on the JFK "ask not" clip:curl -X POST http://localhost:8080/v1/audio/transcriptions -F file=@jfk.wav -F model=nemo-parakeet-tdt-0.6b -F 'timestamp_granularities[]=word' -F response_format=verbose_jsonResponse (excerpt):
{ "segments": [ {"id": 0, "start": 0.24, "end": 0.56, "text": "And", "words": [{"start": 0.24, "end": 0.56, "text": "And"}]}, {"id": 1, "start": 0.56, "end": 0.88, "text": "so,", "words": [{"start": 0.56, "end": 0.88, "text": "so,"}]}, ... ], "text": "And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country." }Each word has correct start/end times in seconds.
Related: #10012, #10134