Skip to content

Add option to TextSplitter to return individual sentences. Adding general SaT model support.#93

Open
hhuangMITRE wants to merge 13 commits intodevelopfrom
feature/nlp-text-splitter-sentence-mode-sat-model-update
Open

Add option to TextSplitter to return individual sentences. Adding general SaT model support.#93
hhuangMITRE wants to merge 13 commits intodevelopfrom
feature/nlp-text-splitter-sentence-mode-sat-model-update

Conversation

@hhuangMITRE
Copy link
Contributor

@hhuangMITRE hhuangMITRE commented Sep 23, 2025

Issues:

Related PRs:

Summary:

This PR updates the nlp_text_splitter to add SaT (segment any text, https://github.com/segment-any-text/wtpsplit) model support, newline handling, sentence splitting options, and preferred-limit chunking.

Before this PR, only WtP and spaCy models are available for text segmentation. There also was a general segmentation strategy: estimate a sentence break near a hard size limit, then walk back to the nearest sentence boundary to generate the largest possible chunk. This works well for components with a large character or token text limit, however it may create issues for other components where a smaller text limit is needed (where possible).

This update adds:

  • SaT model support.

  • Newline normalization so line breaks can be treated as spaces, removed, or preserved depending on language/script estimated.

  • New options for chunk vs individual sentence splitting.

  • A new preferred/soft limit so users can try to generate smaller chunks while still respecting a hard text limit.

  • Along the way, an improved breakpoint alignment function was added so sentence boundaries are computed against the original text, as SaT and WtP appeared to remove extraneous whitespace while splitting. Additional changes were made to handle edge cases for empty outputs, zero-length text, and mid-word splits

  • Finally, support for Flores/NLLB language codes in WtpLanguageSettings is being transferred over to this PR.


This change is Reviewable

@hhuangMITRE hhuangMITRE requested a review from jrobble September 23, 2025 20:24
@hhuangMITRE hhuangMITRE self-assigned this Sep 23, 2025
@hhuangMITRE hhuangMITRE changed the title Feature/nlp text splitter sentence mode sat model update Add option to TextSplitter to return individual sentences. Sep 23, 2025
@hhuangMITRE hhuangMITRE changed the title Add option to TextSplitter to return individual sentences. Add option to TextSplitter to return individual sentences. Adding SaT model support. Sep 23, 2025
@hhuangMITRE hhuangMITRE changed the title Add option to TextSplitter to return individual sentences. Adding SaT model support. Add option to TextSplitter to return individual sentences. Adding general SaT model support. Sep 23, 2025
Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrobble reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE)


a discussion (no related file):
Mention SaT here:

# To hold spaCy, WtP, and other potential sentence detection models in cache

Mention SaT here:

            log.warning(
                "Invalid model setting '%s'. Only `cpu` and `cuda` "
                        "(or `gpu`) WtP model options available at this time. "
                        "Defaulting to `cpu` mode.", model_setting)

Mention SaT in install.sh and LICENSE.


detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 83 at r1 (raw file):

            self._update_wtp_model(model_name, model_setting, default_lang)
            self.split = self._split_wtp
            log.info("Setup WtP model: %s", model_name)

Generally, 'f' strings are preferred since they keep the variable name inline with the text. It makes things easier to read.


detection/nlp_text_splitter/tests/test_text_splitter.py line 68 at r1 (raw file):

        self.assertEqual(2, len(actual))
        self.assertEqual('Hello, what is your name? ', actual[0])
        self.assertEqual('My name is John.', actual[1])

These asserts as the same as above test_sat_basic_sentence_split test. I would feel better if we can prove that the different splitting behaviors return different results.


detection/nlp_text_splitter/tests/test_text_splitter.py line 104 at r1 (raw file):

            500,
            len,
            self.sat_model,split_mode=SplitMode.SENTENCE))

Formatting nitpick: Move split_mode to next line.


detection/nlp_text_splitter/tests/test_text_splitter.py line 106 at r1 (raw file):

            self.sat_model,split_mode=SplitMode.SENTENCE))
        self.assertEqual(input_text, ''.join(actual))
        self.assertEqual(2, len(actual))

These asserts as the same as above. I would feel better if we can prove that the different splitting behaviors return different results.

Copy link
Contributor Author

@hhuangMITRE hhuangMITRE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE and @jrobble)


a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Mention SaT here:

# To hold spaCy, WtP, and other potential sentence detection models in cache

Mention SaT here:

            log.warning(
                "Invalid model setting '%s'. Only `cpu` and `cuda` "
                        "(or `gpu`) WtP model options available at this time. "
                        "Defaulting to `cpu` mode.", model_setting)

Mention SaT in install.sh and LICENSE.

Done.


detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 83 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Generally, 'f' strings are preferred since they keep the variable name inline with the text. It makes things easier to read.

Updated, thanks!


detection/nlp_text_splitter/tests/test_text_splitter.py line 68 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

These asserts as the same as above test_sat_basic_sentence_split test. I would feel better if we can prove that the different splitting behaviors return different results.

I've added in the new test cases. There's also some new differences in translation which I've added to the other PR.


detection/nlp_text_splitter/tests/test_text_splitter.py line 104 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Formatting nitpick: Move split_mode to next line.

Done!


detection/nlp_text_splitter/tests/test_text_splitter.py line 106 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

These asserts as the same as above. I would feel better if we can prove that the different splitting behaviors return different results.

I've tweaked the test, right now SaT seems more sensitive to splitting it seems.

@hhuangMITRE hhuangMITRE requested a review from gonzalezjo March 9, 2026 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants