Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
af8f411
feat: add bulk scrape functionality
dan-and Sep 23, 2025
7796f32
fix: update axios to 1.12.2 for security vulnerability
dan-and Sep 23, 2025
853dfee
feat: improve robots.txt filtering and URL validation
dan-and Sep 24, 2025
9da8569
git commit -m "feat: add scrapeId to document metadata for traceability
dan-and Sep 24, 2025
c4aa67f
feat: add Docker ulimit configuration for better performance
dan-and Sep 24, 2025
fe075a4
fix: respect limit parameter in sitemap processing
dan-and Sep 24, 2025
4480a2e
fix: implement visited_unique tracking for limit enforcement
dan-and Sep 24, 2025
d238fdd
fix: status handling improvements on V0 and V1 api
dan-and Sep 24, 2025
c000568
feat: implement structured logging with Winston
dan-and Sep 24, 2025
fa316b2
refactor: remove unnecessary logs and clean up console statements
dan-and Sep 24, 2025
9381721
fix: increase Docker ulimit -n for better performance
dan-and Sep 24, 2025
66a869c
feat: add URL deduplication and redis connection optimization
dan-and Sep 24, 2025
ef3077f
feat(scraper): detect and skip PDF content in fetch and playwright sc…
dan-and Feb 25, 2026
1a76375
feat: complete all Tier 1 migration tasks from upstream firecrawl-ori…
dan-and Feb 25, 2026
feacb7b
fix(scraper): default to absolute path replacement and fix relative U…
dan-and Feb 25, 2026
aa3be34
T2-D: Switch UUID generation from v4 to v7
dan-and Feb 25, 2026
30e0571
T2-E: Disable markdown conversion for crawler link extraction
dan-and Feb 25, 2026
6080a3d
perf: add separate 15s timeout for plain HTTP fetch (vs 60s for browser)
dan-and Feb 26, 2026
aad227a
fix: adjust queue behavior for better job handling
dan-and Feb 26, 2026
ef29dd6
Implement T1-N and T1-O: Performance optimizations for scraping
dan-and Feb 26, 2026
1626fd8
Implement T1-P (CycleTLS) and T1-Q (html-to-markdown fork)
dan-and Feb 26, 2026
ee8888d
feat(scraper): add Google Docs/Slides/Sheets/Drive URL rewriting
dan-and Feb 26, 2026
86e1c16
fix(docker): add CA certificate bundle to resolve HTTPS errors
dan-and Feb 26, 2026
d9e1333
fix(scraper): add 415 terminal status and sitemap URL limit
dan-and Feb 26, 2026
d02e9e5
feat(controller): add actions guard and markdown size guard for large…
dan-and Feb 26, 2026
23c3b23
feat(engine): add domain-pattern-based engine forcing
dan-and Feb 26, 2026
b8bbfa2
feat(proxy): add engine forcing via proxy parameter and improve error…
dan-and Feb 26, 2026
a5b31dd
feat(hero): upgrade from alpha.31 to alpha.34 for Chrome 139 emulation
dan-and Feb 26, 2026
df74169
feat(scrape): add queueDurationMs to response metadata
dan-and Feb 26, 2026
679b3e9
feat(WebScraper): skip Redis cache for screenshot requests
dan-and Feb 26, 2026
9911f14
feat(security): block private/internal IPs to prevent SSRF
dan-and Feb 26, 2026
d41625c
fix(hero): add configurable startup timeout for Hero service
dan-and Feb 26, 2026
3e15c77
feat(map): add ignoreCache parameter to /v1/map endpoint
dan-and Feb 26, 2026
62cd188
fix(single_url): add clear error message for screenshot format on PDF…
dan-and Feb 26, 2026
625c9df
feat(crawler): add sitemapOnly option to restrict crawling to sitemap…
dan-and Feb 27, 2026
b396064
feat(scrape): add browser-action execution support in Hero service
dan-and Feb 27, 2026
d308870
feat(scrapers): add DOCX and legacy .doc document scraping
dan-and Feb 27, 2026
84418fd
feat(cache): add minAge cache freshness filter
dan-and Feb 27, 2026
dd2a190
feat(worker): add runpod pod id logging to sigterm handler
dan-and Feb 27, 2026
f6c2ad1
feat(scraper): add XLSX/Excel spreadsheet scraping
dan-and Feb 27, 2026
ccfb1d6
chore(puppeteer-service): remove dead first FROM stage from Dockerfile
dan-and Feb 28, 2026
d68be66
chore(renovate): add renovate.json with Node LTS 22 pin
dan-and Feb 28, 2026
9e1faf1
feat(scrape): add html and links as independently requestable formats
dan-and Mar 1, 2026
a55a813
fix(scraper): fix GDPR consent wall producing raw HTML in markdown field
dan-and Mar 1, 2026
264cc07
test(scrape): add asserting format isolation tests
dan-and Mar 1, 2026
62ea34b
chore(deploy): configure Traefik routing and tighten resource limits
dan-and Mar 1, 2026
48bb52e
fix(scraper): avoid bot-blocking 404s by using Chrome UA and retrying…
dan-and Mar 1, 2026
9d8a0ac
chore(deploy): rename Traefik router and service labels from firecraw…
dan-and Mar 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 25 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ services:
- REDIS_URL=${FIRECRAWL_REDIS_URL:-redis://redis:6379}
- REDIS_RATE_LIMIT_URL=${FIRECRAWL_REDIS_URL:-redis://redis:6379}
- PLAYWRIGHT_MICROSERVICE_URL=${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000}
- TLS_CLIENT_ENABLED=${TLS_CLIENT_ENABLED:-true}
- PORT=${PORT:-3002}
- NUM_WORKERS_PER_QUEUE=${NUM_WORKERS_PER_QUEUE}
- BULL_AUTH_KEY=${BULL_AUTH_KEY}
Expand All @@ -88,6 +89,7 @@ services:
- REDIS_URL=${FIRECRAWL_REDIS_URL:-redis://redis:6379}
- REDIS_RATE_LIMIT_URL=${FIRECRAWL_REDIS_URL:-redis://redis:6379}
- PLAYWRIGHT_MICROSERVICE_URL=${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000}
- TLS_CLIENT_ENABLED=${TLS_CLIENT_ENABLED:-true}
- PORT=${PORT:-3002}
- NUM_WORKERS_PER_QUEUE=${NUM_WORKERS_PER_QUEUE}
- BULL_AUTH_KEY=${BULL_AUTH_KEY}
Expand Down Expand Up @@ -127,13 +129,33 @@ Firecrawl simple works as follows:
4. URL's which have not already been scraped and match the `include` and `exclude` criteria from the HTML received from the scrape get added to the queue from each worker
5. Steps 2-4 continue until no new links are found or the `limit` specified on the crawl is reached

### Scraper Technologies

Firecrawl Simple uses a three-tier scraping fallback system to maximize success rate while minimizing resource usage:

| Scraper Name | Technology | Description | When Used |
|---------------|------------|-------------|------------|
| **fetch** | `undici.request()` (Node.js 22+) | Plain HTTP/HTTPS requests using Node's internal HTTP client. 3-4x faster than axios. | Static HTML pages, sites without bot protection. **First attempt** in fallback chain. |
| **tls-client** | **CycleTLS** (Go subprocess) | TLS fingerprinting that makes HTTPS requests appear as Chrome browser. Bypasses ~60-80% of bot protection (Cloudflare, Akamai, DataDome). | Sites that block plain HTTP but don't require full JavaScript execution. **Second attempt** when `TLS_CLIENT_ENABLED=true`. |
| **playwright** | **Hero** (`@ulixee/hero`) | Full headless Chromium browser with JavaScript execution and stealth capabilities. | JS-rendered pages, sites requiring browser automation. **Last resort** in fallback chain. |

**Fallback Order:** fetch → tls-client → playwright

**Important Notes:**
- The term "fetch" refers to undici's HTTP client, not the browser `fetch()` API
- The term "playwright" is a legacy name - we actually use `@ulixee/hero`, not Playwright
- `tls-client` is optional and opt-in via `TLS_CLIENT_ENABLED=true` environment variable (default: true)
- Each scraper is tried sequentially; if one succeeds (≥100 chars content or screenshot), the loop breaks

### Scaling concerns

Your scaling bottlenecks will be the following in-order:

1. `MAX_CONCURRENCY` (number of headless puppeteer browsers) on each of the `playwright-service`
2. Actual number of `playwright-service`'s you have behind your load-balancer
3. Number of `firecrawl-worker`'s you have (very rarely the case this is your bottleneck)
1. `MAX_CONCURRENCY` (number of Hero browser instances) on each of the `playwright-service` service
2. Actual number of `playwright-service` containers you have behind your load-balancer
3. Number of `firecrawl-worker` containers you have (very rarely the case this is your bottleneck)

Note: While the service is named `playwright-service`, it actually runs `@ulixee/hero` (a full headless browser), not Playwright. The naming is a legacy artifact from before the Hero migration.

### Crawling

Expand Down
1 change: 1 addition & 0 deletions _env
3 changes: 2 additions & 1 deletion apps/api/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM node:20-slim AS base
FROM node:22-slim AS base

# Create app directory
WORKDIR /app
Expand All @@ -19,6 +19,7 @@ RUN apt-get update -qq && \
ca-certificates \
git \
golang-go \
python3 \
&& update-ca-certificates

# Copy the rest of the application
Expand Down
Loading