Conversation
Implements parseAndNormalize(code) which parses JavaScript using goja's parser and walks the resulting AST to produce a normalized Token stream. Identifiers collapse to KindVar, literals to their typed kinds, and structural constructs emit KindKeyword/KindPunct tokens so the stream still encodes program structure. Unknown node types fall through to KindUnknown to keep the walker deterministic as more JS constructs surface in subsequent tasks. Promotes github.com/dop251/goja from indirect to direct dependency.
…t dual-channel error
Scaffolds the data-layer entities and gormigrate migration for the Phase 2 code similarity detection and integrity review feature: - Fingerprint / SimilarPair / SuspectSummary - SimilarityWhitelist / IntegrityWhitelist / IntegrityReview Registers T20260414 in migrations/init.go.
- Replace errors.Is(err, gorm.ErrRecordNotFound) with db.RecordNotFound(err) to match the canonical CaGo helper used in 64 other repo sites. - Drop createtime/updatetime mutation from Upsert; service layer owns the clock per existing repo convention (see internal/service/* sites). - UpdateParseStatus now takes scannedAt int64 explicitly so the repo stays clock-free; regenerate mock to reflect the new signature.
docs/docs.go is regenerated by swag and its legacy strings.Replace calls trigger staticcheck QF1004 — we don't own the template, so exclude the entire directory rather than chase transient hits.
Adds the Phase 4 operations tooling required by §4.5 and §8.5 bootstrap: - similarity_patrol crontab handler with two modes: daily Patrol() for incremental catch-up of scripts whose latest code is newer than their last fingerprint scan, and RunBackfill() kicked off by the admin endpoint to iterate every active script from a persisted cursor with rate-limiting and resumable state. - similarity_repo.PatrolQueryRepo with ListStaleScriptIDs / ListScriptIDsFromCursor / CountScripts for the two modes. - similarity_svc.BackfillState helpers persisted via system_config: TryAcquireBackfillLock, SetBackfillCursor, FinishBackfill, ResetBackfillCursor. State survives restarts and prevents simultaneous admin clicks from double-starting. - Admin endpoints POST /admin/similarity/backfill (with reset flag for §8.5 step 9), GET /admin/similarity/backfill/status, POST /admin/similarity/scan/:script_id, and POST /admin/similarity/stop-fp/refresh (§8.5 step 8 on-demand refresh). - RegisterBackfillRunner + RegisterStopFpRefresher function-injection seams wire the crontab handler methods into admin_svc without an import cycle. - Production code uses function-typed fields / package vars for all Redis, NSQ producer, system_config, and PatrolQuery dependencies so unit tests can substitute fakes (matches existing similarity_stop_fp pattern). - 31 new unit tests covering backfill state (9), admin_backfill + stop-fp refresh (11), and patrol handler (11) — including resume-from-cursor, ctx cancellation during rate-limit sleep, re-entry guards, and publish-failure continuation.
- 用 ES cardinality 聚合替代 Σ CommonCount 修正 coverage 计算, 消除跨候选指纹的双重计数(spec §4.1 Step 5) - 新增 PurgeScriptData 级联清理 ES/fingerprint/pair/summary, 通过 ScriptDeleteMsg.HardDelete 字段驱动的新 consumer 接入 硬删除路径(spec §4.6) - 回填 running flag 改用 Redis SETNX 原子 CAS,消除两位管理员 同时点启动的竞态;元数据仍落在 system_config(spec §2.3/§4.5) - DBConfigProvider 新增 GetBool/GetFloat/GetInt,Similarity() 在 YAML 之上叠加 pre_system_config 动态覆盖,让管理员后台可实时 调整 14 个 similarity.* 阈值开关(spec §1.1/§6.1)
gosec (G118) flagged the missing defer even though the goroutine fires cancel on the happy path — defer guards the early-return paths.
Introduce a codeFeatures struct computed once per Check so the four Category-A signals share a single rune pass (line count, max line, whitespace, comment bytes) and the two Category-B signals share one collectIdents call. Adds a benchmark covering 1MB obfuscated and 256KB plain samples. On M1: obfuscated 1MB goes 142ms → 80ms (1.78x, allocs halved), plain 256KB 64ms → 50ms (1.29x, allocs -40%).
ScriptCat wraps background/cron scripts in (async function(){ ... })() at
runtime, making top-level return and await legal. The fingerprint parser
treated the source as a standalone ECMAScript Script, so scripts using
either feature were rejected with "Illegal return statement" and marked
parse_status=failed, falling out of the similarity index.
parseAndNormalize now retries wrapped on parse failure and shifts token
positions back by the wrapper prefix length (clamped into the original
source range) so downstream match segments still point at real bytes.
Drop the 512KB auto-default on MaxCodeSize. scan.go already guards on `MaxCodeSize > 0`, so zero now means unlimited (bounded only by the API-level 10MB cap on script code). Default config example updated to 0 so fresh deployments index all scripts the backend will accept.
Adds GET /admin/similarity/parse-failures so operators can triage scripts that are invisible to similarity comparison. Default filter is parse_status=failed; pass status=2 to see skipped rows. Rescan uses the existing POST /admin/similarity/scan/:script_id, no new action required. Introduces FingerprintRepo.ListByParseStatus with ParseFailureFilter, the adminSvc.ListParseFailures handler composing script + user briefs, and wires the route into the admin middleware group.
Reset=true on /admin/similarity/backfill previously only zeroed the cursor but left the Scan code_hash short-circuit intact, so every rescanned script no-oped with "code unchanged, skipping" and the admin saw no effect. Thread a force flag from TriggerBackfill → RunBackfill → SimilarityScanMsg → consumer → ScanSvc.Scan. When force=true the short-circuit is bypassed so extraction, ES indexing, and pair upsert all run again. Patrol and the publish/update script events keep force=false to stay idempotent.
…ments IntegritySvc.Check now logs final score, per-category breakdown, and hit signal names so ops can trace why a given script landed in a specific zone. RecordWarning surfaces marshal/upsert failures with full context. BuildMatchSegments logs each load step (fingerprint row, ES positions) and the final segment count, making the evidence-page build path debuggable without attaching a debugger.
Two related issues on the admin similar-pairs view: 1. Soft-deleted scripts kept showing in the pair list with no indication. Per spec §4.6 we deliberately preserve the underlying fingerprint as evidence, so instead of cascading the delete we surface the state: ScriptBrief now exposes IsDeleted, and ListPairsRequest accepts an ExcludeDeleted toggle that JOINs cdb_tampermonkey_script to filter pairs whose either side is in DELETE status. 2. After a script's code changes such that an old pair drops below the Jaccard threshold, the row in pre_script_similar_pair was never touched again and lingered as a zombie. Scan() now calls DeletePendingByScriptID right after candidate lookup so any pair that's still similar gets re-Upserted by step 11 while obsolete pending rows disappear. Whitelisted / reviewed pairs are preserved because those statuses are explicit admin decisions.
walkNode only handled an ES5 subset and dropped to a single KindUnknown for any unrecognized AST node, so any modern userscript starting with a top-level `class` (or built around let/const, arrows, async/await, template literals, destructuring, etc.) collapsed to under 14 tokens and tripped the `too_few_fingerprints` skip in scan — leaving stale similar pairs frozen forever. Rewrite walkNode to cover the full goja AST: classes (including private fields, static blocks, methods, getters), lexical declarations, arrow functions, template literals, await/yield, try/catch/throw, switch/case, for-in/for-of, do-while, with, optional chaining, spread/rest, destructuring patterns, sequence + conditional + unary expressions, new, super, this, meta-property, and PropertyKeyed/Short (which the old object-literal walker had been silently turning into KindUnknown). Also plug the scan early-exit cleanup hole: when scan bails out at any of the five guard paths (soft-deleted / oversized / parse-failed / too-few-fingerprints / non-active), still purge pending pairs touching this script. Otherwise scripts that *used to* match leave their old pairs visible forever, since no later scan reaches step 10b for them. Tested with testdata/1.js (ScriptCat OCS helper, 59KB, 1335 lines): fingerprints went from 1 to 866, total tokens from <14 to 4705. walkNode coverage 63.5% -> 85.6%, purgePendingPairs 50% -> 100%.
HTTP 请求中仅执行快速信号(预计算 Cat A + 已知打包器 Cat D), 耗时正则信号(标识符提取、注释统计、字符串数组检测等)由 similarity.scan NSQ 消费者异步处理,避免大型脚本发布超时。 - 新增 CheckFast() 方法,已知打包器签名匹配即时拦截 (score=1.0) - scan.go 步骤 2b:异步完整性检查 + 自动归档 + 记录警告 - 移除已废弃的 integrity.warning 消息队列流程 - 新增 integrity_async_auto_archive 配置项
这两个编码方式极其冷门,实际恶意脚本几乎不会使用,且其代码特征 会被其他信号(单字符标识符比率、空白比率等)覆盖,无需专门检测。
There was a problem hiding this comment.
Pull request overview
This PR introduces the plumbing for a script similarity detection system plus an integrity (minify/obfuscation) pre-check, including persistence tables, NSQ topics/consumers, cron-driven patrol/backfill jobs, and admin/evidence endpoints.
Changes:
- Add DB migrations + new similarity/integrity entities and repositories (MySQL + Elasticsearch index init).
- Add similarity.scan producer/consumer, hard-delete purge consumer, and cron handlers for patrol/backfill + stop-fingerprint refresh.
- Add integrity fast pre-check into script create/update flows, plus config (YAML + DB overrides) and routing/controller wiring.
Reviewed changes
Copilot reviewed 103 out of 107 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| migrations/init.go | Registers new migration. |
| migrations/20260414.go | Creates/drops similarity tables. |
| internal/task/producer/topic.go | Adds similarity.scan topic. |
| internal/task/producer/similarity.go | Producer + subscribe helpers. |
| internal/task/producer/similarity_test.go | Round-trip msg parsing test. |
| internal/task/producer/script.go | Adds HardDelete flag to delete msg. |
| internal/task/crontab/handler/similarity_stop_fp.go | Stop-fingerprint refresh job. |
| internal/task/crontab/handler/similarity_stop_fp_test.go | Stop-fp handler unit tests. |
| internal/task/crontab/handler/similarity_patrol.go | Patrol + backfill cron handler. |
| internal/task/crontab/handler/similarity_patrol_test.go | Patrol/backfill unit tests. |
| internal/task/crontab/crontab.go | Registers similarity cron handlers. |
| internal/task/consumer/subscribe/similarity_scan.go | Consumer for similarity.scan. |
| internal/task/consumer/subscribe/similarity_scan_test.go | Consumer dispatch/force tests. |
| internal/task/consumer/subscribe/similarity_purge.go | Hard-delete purge consumer. |
| internal/task/consumer/subscribe/similarity_purge_test.go | Purge consumer tests. |
| internal/task/consumer/consumer.go | Registers new subscribers. |
| internal/service/similarity_svc/testdata/reorder_pair/original.js | Similarity fixtures. |
| internal/service/similarity_svc/testdata/reorder_pair/reordered.js | Similarity fixtures. |
| internal/service/similarity_svc/testdata/rename_pair/original.js | Similarity fixtures. |
| internal/service/similarity_svc/testdata/rename_pair/renamed.js | Similarity fixtures. |
| internal/service/similarity_svc/testdata/different_pair/a.js | Similarity fixtures. |
| internal/service/similarity_svc/testdata/different_pair/b.js | Similarity fixtures. |
| internal/service/similarity_svc/testdata/integrity/normal/plain_userscript.js | Integrity fixtures. |
| internal/service/similarity_svc/testdata/integrity/normal/embedded_small_lib.js | Integrity fixtures. |
| internal/service/similarity_svc/testdata/integrity/minified/uglify_output.js | Integrity fixtures. |
| internal/service/similarity_svc/testdata/integrity/minified/terser_output.js | Integrity fixtures. |
| internal/service/similarity_svc/testdata/integrity/packed/dean_edwards_packer.js | Integrity fixtures. |
| internal/service/similarity_svc/testdata/integrity/obfuscated/obfuscator_io_level1.js | Integrity fixtures. |
| internal/service/similarity_svc/testdata/integrity/obfuscated/obfuscator_io_level4.js | Integrity fixtures. |
| internal/service/similarity_svc/testdata/integrity/borderline/has_vendored_json.js | Integrity fixtures. |
| internal/service/similarity_svc/purge.go | Purge cascade implementation. |
| internal/service/similarity_svc/purge_test.go | Purge cascade tests. |
| internal/service/similarity_svc/pending_warning.go | Integrity result types. |
| internal/service/similarity_svc/mock/scan.go | Generated scan mock. |
| internal/service/similarity_svc/mock/integrity.go | Generated integrity mock. |
| internal/service/similarity_svc/match_segments.go | Build UI match segments. |
| internal/service/similarity_svc/match_segments_test.go | Match segment tests. |
| internal/service/similarity_svc/integrity_signals.go | Integrity signals implementation. |
| internal/service/similarity_svc/integrity_signals_test.go | Signal unit tests. |
| internal/service/similarity_svc/integrity.go | Integrity service + messaging. |
| internal/service/similarity_svc/integrity_test.go | Integrity end-to-end tests. |
| internal/service/similarity_svc/integrity_bench_test.go | Integrity benchmarks. |
| internal/service/similarity_svc/doc.go | Package-level docs. |
| internal/service/similarity_svc/backfill_state.go | Backfill state + Redis lock. |
| internal/service/similarity_svc/backfill_state_test.go | Backfill state tests. |
| internal/service/similarity_svc/admin_backfill.go | Admin backfill/manual scan hooks. |
| internal/service/similarity_svc/access.go | Evidence access middleware. |
| internal/service/similarity_svc/access_test.go | Access service smoke test. |
| internal/service/script_svc/script.go | Integrates integrity gate + scan publish. |
| internal/repository/similarity_repo/fingerprint.go | Fingerprint MySQL repo. |
| internal/repository/similarity_repo/fingerprint_test.go | Repo shape test. |
| internal/repository/similarity_repo/fingerprint_es_init.go | ES index create helper. |
| internal/repository/similarity_repo/fingerprint_es_test.go | ES query-body tests. |
| internal/repository/similarity_repo/patrol_query.go | Patrol/backfill SQL repo. |
| internal/repository/similarity_repo/similar_pair.go | Pair repo + normalization. |
| internal/repository/similarity_repo/suspect_summary.go | Suspect summary repo. |
| internal/repository/similarity_repo/similarity_whitelist.go | Pair whitelist repo. |
| internal/repository/similarity_repo/integrity_whitelist.go | Integrity whitelist repo. |
| internal/repository/similarity_repo/integrity_review.go | Integrity review queue repo. |
| internal/repository/similarity_repo/*_test.go | Repo/interface shape tests. |
| internal/repository/similarity_repo/mock/*.go | Generated repo mocks. |
| internal/repository/similarity_repo/doc.go | Repo package docs. |
| internal/repository/script_repo/script_code.go | Adds FindByIDIncludeDeleted. |
| internal/repository/script_repo/mock/script_code.go | Updates mock for new method. |
| internal/pkg/code/code.go | Adds similarity error codes. |
| internal/pkg/code/zh_cn.go | Adds similarity zh-CN messages. |
| internal/model/entity/similarity_entity/*.go | New similarity/integrity entities. |
| internal/controller/similarity_ctr/similarity.go | Similarity controller methods. |
| internal/api/router.go | Wires admin + evidence routes. |
| configs/db_provider.go | Adds typed getters (bool/float/int). |
| configs/db_provider_test.go | Adds DB provider tests. |
| configs/config.go | Adds SimilarityConfig + defaults/overrides + validate hook. |
| configs/config.yaml.example | Adds similarity config examples. |
| cmd/app/main.go | Registers similarity repos/services + ensures ES index. |
| go.mod | Adds deps for similarity/integrity. |
| go.sum | Updates dependency checksums. |
| .golangci.yml | Updates lint exclusions. |
| .gitignore | Ignores .omc directory. |
| // similarity.scan_enabled=true 需要 elasticsearch 地址(cago 读取 elasticsearch.address 列表) | ||
| if cfg.Bool(ctx, "similarity.scan_enabled") { | ||
| var esAddress []string | ||
| _ = cfg.Scan(ctx, "elasticsearch.address", &esAddress) | ||
| if len(esAddress) == 0 { | ||
| return fmt.Errorf("similarity.scan_enabled=true requires elasticsearch.address to be set") | ||
| } | ||
| } |
There was a problem hiding this comment.
Validate() checks cfg.Bool("similarity.scan_enabled") to decide whether Elasticsearch must be configured, but Similarity() defaults ScanEnabled=true even when the YAML key is absent. This can let the app start without elasticsearch.address while similarity scanning is effectively enabled (and main.go later calls EnsureFingerprintIndex based on Similarity().ScanEnabled). Consider basing this check on Similarity().ScanEnabled (or otherwise applying the same defaulting logic as Similarity()) so startup validation matches runtime behavior.
| // 完整性前置检查(仅执行快速信号,耗时信号由相似度扫描消费者异步处理) | ||
| if similarity_svc.IntegrityEnabled() && similarity_svc.Integrity() != nil && req.Code != "" { | ||
| latest, _ := script_repo.ScriptCode().FindLatest(ctx, script.ID, 0, true) | ||
| var existingHash string | ||
| if latest != nil { | ||
| existingHash = sha256HexString(latest.Code) | ||
| } | ||
| newHash := sha256HexString(req.Code) | ||
| if newHash != existingHash { | ||
| whitelisted, _ := similarity_svc.Integrity().IsWhitelisted(ctx, script.ID) | ||
| if !whitelisted { |
There was a problem hiding this comment.
UpdateCode's integrity pre-check ignores errors from ScriptCode().FindLatest and Integrity().IsWhitelisted (both assigned to _). If either call fails transiently, the code may treat the script as not whitelisted / changed and incorrectly block an update (400) instead of surfacing a server error or skipping the integrity gate. Handle these errors explicitly (e.g., return the error, or fail-open with a warning log depending on desired policy) to avoid false rejections.
| ok, release, err := h.acquireBackfillLock(ctx) | ||
| if err != nil || !ok { | ||
| logger.Ctx(ctx).Warn("similarity backfill: redis lock unavailable", | ||
| zap.Bool("ok", ok), zap.Error(err)) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
RunBackfill returns nil when acquireBackfillLock returns an error (it checks if err != nil || !ok { ...; return nil }). This suppresses real Redis failures and makes backfill runs silently no-op, which is hard to detect/alert on. Consider returning the error when err != nil, and only treating !ok (lock held) as a nil/no-op path.
Code reviewFound 7 issues (issues 1-3 confirm Copilot's findings, 4-7 are additional):
Fix: use Lines 239 to 246 in 074e812
scriptlist/internal/service/script_svc/script.go Lines 453 to 472 in 074e812
When Fix: split the condition — return scriptlist/internal/task/crontab/handler/similarity_patrol.go Lines 168 to 175 in 074e812
The same Redis key scriptlist/internal/service/similarity_svc/scan.go Lines 76 to 78 in 074e812 scriptlist/internal/task/crontab/handler/similarity_stop_fp.go Lines 16 to 18 in 074e812
Both scriptlist/internal/service/script_svc/script.go Lines 312 to 318 in 074e812 scriptlist/internal/pkg/code/zh_cn.go Lines 151 to 152 in 074e812
scriptlist/internal/service/similarity_svc/scan.go Lines 316 to 321 in 074e812
scriptlist/internal/service/script_svc/script.go Lines 1026 to 1030 in 074e812 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
… constant Export StopFpRedisKey from similarity_svc (the domain owner) and remove the duplicate local definition from the stop-fp crontab handler, so a key change in either place can no longer silently break the other.
…Code integrity check
…ayer Extract Redis lock and system_config cursor primitives from similarity_svc into a new BackfillStateRepo interface in similarity_repo, following the project's service-locator convention. Service layer retains business logic; tests updated to use MockBackfillStateRepo instead of function-var faking.
No description provided.