Skip to content

fix(scrapy)!: serialize requests and HTTP cache as JSON instead of pickle#951

Merged
vdusek merged 12 commits into
masterfrom
fix/scrapy-json-serialization
Jun 10, 2026
Merged

fix(scrapy)!: serialize requests and HTTP cache as JSON instead of pickle#951
vdusek merged 12 commits into
masterfrom
fix/scrapy-json-serialization

Conversation

@vdusek

@vdusek vdusek commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Scrapy requests (in the Apify request queue) and cached responses (in the Scrapy HTTP cache) were serialized with pickle, but both storages hold JSON. This switches them to JSON via a shared serializer (apify/scrapy/_serialization.py).

Request serialization:

  • Binary fields (body, headers) are base64-encoded; pydantic models such as Crawlee's UserData are dumped via model_dump.
  • A non-JSON-serializable meta/cb_kwargs is logged and the request skipped, instead of being silently corrupted.
  • A serialized _class is honored only when it is already imported as a scrapy.Request subclass.

HTTP cache, so existing caches survive the upgrade:

  • A legacy pickle entry fails to load, so reads treat it as a cache miss and re-fetch it, instead of crashing the download.
  • The cleanup sweep now deletes stale, expired, or malformed entries. It previously overwrote them with set_value(..., None), which under apify-client v3 stores JSON null rather than deleting, leaking dead records.

BREAKING CHANGE: requests and HTTP cache entries are now stored as JSON, not pickle. Entries written by older versions are ignored (re-fetched), not read.

@vdusek vdusek added adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. labels Jun 9, 2026
@vdusek vdusek self-assigned this Jun 9, 2026
@github-actions github-actions Bot added this to the 142nd sprint - Tooling team milestone Jun 9, 2026
@github-actions github-actions Bot added the tested Temporary label used only programatically for some analytics. label Jun 9, 2026
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 99.14530% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 89.94%. Comparing base (03f97a3) to head (d404786).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/apify/scrapy/extensions/_httpcache.py 96.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #951      +/-   ##
==========================================
+ Coverage   87.08%   89.94%   +2.86%     
==========================================
  Files          48       49       +1     
  Lines        2973     3053      +80     
==========================================
+ Hits         2589     2746     +157     
+ Misses        384      307      -77     
Flag Coverage Δ
e2e 35.66% <0.00%> (-0.96%) ⬇️
integration 56.86% <0.00%> (-1.54%) ⬇️
unit 78.64% <99.14%> (+3.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…ckle

Scrapy requests stored in the Apify request queue and responses stored in the
Scrapy HTTP cache were serialized with pickle. Those storages hold JSON, and
pickle reconstructs a Python object graph from the stored bytes. Serialize them
as JSON instead, via a single shared serializer (`_serialization.py`) used by
both the request converter and the HTTP cache.

Only the known binary fields (`body`, `headers`) are base64-encoded; pydantic
models are dumped via `model_dump`. A non-JSON-serializable `meta`/`cb_kwargs`
is reported and the request skipped rather than silently corrupted. A `_class`
entry is only honored when already imported as a `scrapy.Request` subclass, and
malformed or legacy (pickle-format) cache entries are treated as a cache miss
so reads do not crash after the upgrade.

BREAKING CHANGE: requests and HTTP cache entries are now stored as JSON, not
pickle. Entries written by older versions are ignored (re-fetched), not read.
@vdusek vdusek force-pushed the fix/scrapy-json-serialization branch from a24fd73 to d73fcb4 Compare June 9, 2026 14:30
@vdusek vdusek requested a review from Pijukatel June 9, 2026 15:18
@vdusek vdusek marked this pull request as ready for review June 9, 2026 15:18
Follow-up to the pickle -> JSON switch, closing the data-loss paths and
quality issues raised in review:

- Resolve a request `_class` by importing it on demand (then validating it is
  a `scrapy.Request` subclass) so custom or lazily-imported subclasses survive
  an Actor migration instead of being dropped.
- Keep the `retrieve_response` field reads inside the crash guard so a cache
  entry missing a key degrades to a miss rather than raising `KeyError`.
- Reconstruct a request before marking it handled in the scheduler (still
  consuming unrecoverable entries so the queue cannot loop on them).
- Encode a `str` request body symmetrically (as UTF-8 bytes) and serialize with
  `ensure_ascii=False` to avoid bloating non-ASCII text.
- Store the serialized request as plain JSON, dropping the redundant outer
  base64 layer.
- Name the offending value when `meta`/`cb_kwargs` is not JSON-serializable and
  document JSON's tuple/dict-key coercions and the non-UTF-8 header trade-off.

Adds serializer tests, an HTTP-cache cleanup-sweep test, a cross-process
request-reconstruction test, and documents `APIFY_HTTPCACHE_EXPIRATION_MAX_ITEMS`.
@vdusek vdusek requested a review from szaganek as a code owner June 10, 2026 10:29
Comment thread docs/03_guides/06_scrapy.mdx Outdated
@vdusek vdusek force-pushed the fix/scrapy-json-serialization branch from d0a8c20 to eabdcbd Compare June 10, 2026 11:19
@vdusek vdusek merged commit a87e8d1 into master Jun 10, 2026
28 checks passed
@vdusek vdusek deleted the fix/scrapy-json-serialization branch June 10, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants