Skip to content

SOLR-18208: Replace abandoned langdetect dependency with maintained fork#4326

Open
janhoy wants to merge 5 commits intoapache:mainfrom
janhoy:feature/SOLR-18208-replace-langdetect-dependency
Open

SOLR-18208: Replace abandoned langdetect dependency with maintained fork#4326
janhoy wants to merge 5 commits intoapache:mainfrom
janhoy:feature/SOLR-18208-replace-langdetect-dependency

Conversation

@janhoy
Copy link
Copy Markdown
Contributor

@janhoy janhoy commented Apr 23, 2026

Replace com.cybozu.labs:langdetect (abandoned since 2012) with io.github.azagniotov:language-detection:12.5.2, a maintained fork with an active release history.

The new library bundles its own language profiles, so the 53 profile files previously shipped in the langid module resources are removed. The factory no longer loads profiles at startup; it creates a shared LanguageDetectionOrchestrator instead. The processor converts the field-content Reader to a String and calls orchestrator.detectAll().

commons-io was only used for profile loading and is also removed from the langid module dependencies.

Some tests needed reworking to pass due to different behavior of the libraries, and that this new supports more languages, which introduces some ambiguity.

https://issues.apache.org/jira/browse/SOLR-18208

Implemented entirely by Claude Code

Replace com.cybozu.labs:langdetect (abandoned since 2012) with
io.github.azagniotov:language-detection:12.5.2, a maintained fork
with an active release history.

The new library bundles its own language profiles, so the 53 profile
files previously shipped in the langid module resources are removed.
The factory no longer loads profiles at startup; it creates a shared
LanguageDetectionOrchestrator instead. The processor converts the
field-content Reader to a String and calls orchestrator.detectAll().

commons-io was only used for profile loading and is also removed from
the langid module dependencies.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates Solr’s langid module from the abandoned com.cybozu.labs:langdetect to the maintained fork io.github.azagniotov:language-detection (SOLR-18208), removing the legacy bundled profile resources and updating processor wiring, dependencies, tests, and license metadata accordingly.

Changes:

  • Replace com.cybozu.labs:langdetect usage with io.github.azagniotov:language-detection and update the update-processor factory/processor integration.
  • Remove shipped langdetect-profiles/* resources and the related RAT exclusion.
  • Update Gradle dependency catalogs/lockfiles and add the new dependency’s LICENSE/NOTICE/SHA1 files; adjust tests for behavior differences.

Reviewed changes

Copilot reviewed 45 out of 69 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
solr/modules/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessorFactory.java Builds and supplies a shared LanguageDetectionOrchestrator to processor instances; removes old profile-loading code.
solr/modules/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java Switches detection to orchestrator.detectAll() and maps results to Solr’s DetectedLanguage.
solr/modules/langid/src/test/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessorFactoryTest.java Adjusts test samples/expectations for the new detector’s behavior and adds a replacement multivalue test.
solr/modules/langid/build.gradle Drops old deps (commons-io, cybozu langdetect) and adds the new language-detection dependency alias.
solr/modules/langid/gradle.lockfile Removes old locked artifacts and adds io.github.azagniotov:language-detection:12.5.2.
gradle/libs.versions.toml Adds version + catalog entry for io.github.azagniotov:language-detection.
gradle/validation/rat-sources.gradle Removes the langdetect-profiles/* exclusion now that those resources are deleted.
solr/licenses/language-detection-LICENSE-ASL.txt Adds the Apache 2.0 license text for the new dependency.
solr/licenses/language-detection-NOTICE.txt Adds NOTICE metadata for the new dependency.
solr/licenses/language-detection-12.5.2.jar.sha1 Adds checksum for the new dependency jar.
solr/licenses/langdetect-NOTICE.txt Removes NOTICE metadata for the old dependency.
solr/licenses/langdetect-LICENSE-ASL.txt Removes license file for the old dependency.
solr/licenses/langdetect-1.1-20120112.jar.sha1 Removes checksum for the old dependency jar.
solr/licenses/jsonic-NOTICE.txt Removes NOTICE metadata for jsonic (previously pulled in by old langdetect).
solr/licenses/jsonic-1.2.7.jar.sha1 Removes checksum for jsonic.
changelog/unreleased/SOLR-18208-replace-langdetect.yml Adds an unreleased changelog entry for the dependency replacement.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/af Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/gu Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/id Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/it Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/ko Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/so Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/sq Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/sw Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/tl Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/vi Removes legacy bundled profile resource.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Apr 23, 2026
@Override
protected LanguageIdentifierUpdateProcessor createLangIdProcessor(ModifiableSolrParams parameters)
throws Exception {
return new LangDetectLanguageIdentifierUpdateProcessor(
_parser.buildRequestFrom(h.getCore(), parameters, null), resp, null);
_parser.buildRequestFrom(h.getCore(), parameters, null),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always disliked the _ pattern, maybe we can finally eradicate it here? Totally onboard if that is "out of scope".

* <ul>
* <li>The "too short" test case is replaced with Japanese (this detector returns Hungarian for
* short ambiguous text rather than "un").
* <li>The Russian text is replaced with a cleaner Cyrillic-only sample. The base class uses a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all this commenting? It's absolutely wonderful while I am reading the PR, but once committed, it will be confusing. So maybe remove before merge?

Someday we can have comments in the code that live just for the branch.. and comments that live permenently ;-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we don’t. It belongs more in a PR comment I believe.

Copy link
Copy Markdown
Contributor

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A joy to read the code!

I think my one request is taht some of the javadcos make sense in explaining to a reader teh changes you are mkaing, but once merged, are confusing. Maybe rework the javadocs to be explaining the nuances of the library, but without referring to the previous version? And any "previous version/current version" notes should go in the Major Changes? There is good stuff there to educate someone upgrading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:index dependencies Dependency upgrades documentation Improvements or additions to documentation module:langid tests tool:build

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants