SOLR-18208: Replace abandoned langdetect dependency with maintained fork#4326
SOLR-18208: Replace abandoned langdetect dependency with maintained fork#4326janhoy wants to merge 5 commits intoapache:mainfrom
Conversation
Replace com.cybozu.labs:langdetect (abandoned since 2012) with io.github.azagniotov:language-detection:12.5.2, a maintained fork with an active release history. The new library bundles its own language profiles, so the 53 profile files previously shipped in the langid module resources are removed. The factory no longer loads profiles at startup; it creates a shared LanguageDetectionOrchestrator instead. The processor converts the field-content Reader to a String and calls orchestrator.detectAll(). commons-io was only used for profile loading and is also removed from the langid module dependencies.
There was a problem hiding this comment.
Pull request overview
This PR migrates Solr’s langid module from the abandoned com.cybozu.labs:langdetect to the maintained fork io.github.azagniotov:language-detection (SOLR-18208), removing the legacy bundled profile resources and updating processor wiring, dependencies, tests, and license metadata accordingly.
Changes:
- Replace
com.cybozu.labs:langdetectusage withio.github.azagniotov:language-detectionand update the update-processor factory/processor integration. - Remove shipped
langdetect-profiles/*resources and the related RAT exclusion. - Update Gradle dependency catalogs/lockfiles and add the new dependency’s LICENSE/NOTICE/SHA1 files; adjust tests for behavior differences.
Reviewed changes
Copilot reviewed 45 out of 69 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| solr/modules/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessorFactory.java | Builds and supplies a shared LanguageDetectionOrchestrator to processor instances; removes old profile-loading code. |
| solr/modules/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java | Switches detection to orchestrator.detectAll() and maps results to Solr’s DetectedLanguage. |
| solr/modules/langid/src/test/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessorFactoryTest.java | Adjusts test samples/expectations for the new detector’s behavior and adds a replacement multivalue test. |
| solr/modules/langid/build.gradle | Drops old deps (commons-io, cybozu langdetect) and adds the new language-detection dependency alias. |
| solr/modules/langid/gradle.lockfile | Removes old locked artifacts and adds io.github.azagniotov:language-detection:12.5.2. |
| gradle/libs.versions.toml | Adds version + catalog entry for io.github.azagniotov:language-detection. |
| gradle/validation/rat-sources.gradle | Removes the langdetect-profiles/* exclusion now that those resources are deleted. |
| solr/licenses/language-detection-LICENSE-ASL.txt | Adds the Apache 2.0 license text for the new dependency. |
| solr/licenses/language-detection-NOTICE.txt | Adds NOTICE metadata for the new dependency. |
| solr/licenses/language-detection-12.5.2.jar.sha1 | Adds checksum for the new dependency jar. |
| solr/licenses/langdetect-NOTICE.txt | Removes NOTICE metadata for the old dependency. |
| solr/licenses/langdetect-LICENSE-ASL.txt | Removes license file for the old dependency. |
| solr/licenses/langdetect-1.1-20120112.jar.sha1 | Removes checksum for the old dependency jar. |
| solr/licenses/jsonic-NOTICE.txt | Removes NOTICE metadata for jsonic (previously pulled in by old langdetect). |
| solr/licenses/jsonic-1.2.7.jar.sha1 | Removes checksum for jsonic. |
| changelog/unreleased/SOLR-18208-replace-langdetect.yml | Adds an unreleased changelog entry for the dependency replacement. |
| solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/af | Removes legacy bundled profile resource. |
| solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/gu | Removes legacy bundled profile resource. |
| solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/id | Removes legacy bundled profile resource. |
| solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/it | Removes legacy bundled profile resource. |
| solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/ko | Removes legacy bundled profile resource. |
| solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/so | Removes legacy bundled profile resource. |
| solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/sq | Removes legacy bundled profile resource. |
| solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/sw | Removes legacy bundled profile resource. |
| solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/tl | Removes legacy bundled profile resource. |
| solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/vi | Removes legacy bundled profile resource. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @Override | ||
| protected LanguageIdentifierUpdateProcessor createLangIdProcessor(ModifiableSolrParams parameters) | ||
| throws Exception { | ||
| return new LangDetectLanguageIdentifierUpdateProcessor( | ||
| _parser.buildRequestFrom(h.getCore(), parameters, null), resp, null); | ||
| _parser.buildRequestFrom(h.getCore(), parameters, null), |
There was a problem hiding this comment.
I always disliked the _ pattern, maybe we can finally eradicate it here? Totally onboard if that is "out of scope".
| * <ul> | ||
| * <li>The "too short" test case is replaced with Japanese (this detector returns Hungarian for | ||
| * short ambiguous text rather than "un"). | ||
| * <li>The Russian text is replaced with a cleaner Cyrillic-only sample. The base class uses a |
There was a problem hiding this comment.
Do we need all this commenting? It's absolutely wonderful while I am reading the PR, but once committed, it will be confusing. So maybe remove before merge?
Someday we can have comments in the code that live just for the branch.. and comments that live permenently ;-)
There was a problem hiding this comment.
No we don’t. It belongs more in a PR comment I believe.
epugh
left a comment
There was a problem hiding this comment.
A joy to read the code!
I think my one request is taht some of the javadcos make sense in explaining to a reader teh changes you are mkaing, but once merged, are confusing. Maybe rework the javadocs to be explaining the nuances of the library, but without referring to the previous version? And any "previous version/current version" notes should go in the Major Changes? There is good stuff there to educate someone upgrading.
Replace com.cybozu.labs:langdetect (abandoned since 2012) with io.github.azagniotov:language-detection:12.5.2, a maintained fork with an active release history.
The new library bundles its own language profiles, so the 53 profile files previously shipped in the langid module resources are removed. The factory no longer loads profiles at startup; it creates a shared LanguageDetectionOrchestrator instead. The processor converts the field-content Reader to a String and calls orchestrator.detectAll().
commons-iowas only used for profile loading and is also removed from the langid module dependencies.Some tests needed reworking to pass due to different behavior of the libraries, and that this new supports more languages, which introduces some ambiguity.
https://issues.apache.org/jira/browse/SOLR-18208
Implemented entirely by Claude Code