- Goal: Crawl a website’s homepage, collect all links, and verify that each link loads successfully and matches its link text.
- Link extraction: Uses
page.extract()with a Pydantic schema to pull all links and their visible text from the homepage. - Content verification: Opens each link and uses AI to assess whether the page content matches what the link text suggests.
- Social link handling: Detects social media domains and only checks that they load (skipping full content verification).
- Batch processing: Processes links in batches controlled by
MAX_CONCURRENT_LINKS(sequential by default, can be made concurrent).
- Stagehand (Python v2): Python client that wraps AI-powered browser automation on top of Browserbase.
Docs →https://docs.stagehand.dev/v3/sdk/python - extract: Extract structured data from web pages using natural language instructions and Pydantic models.
Docs →https://docs.stagehand.dev/basics/extract - concurrent sessions: Run multiple browser sessions at the same time for faster batch processing.
Docs →https://docs.browserbase.com/guides/concurrency-rate-limits
- cd into the template
cd python/website-link-tester
- Create & activate a virtual environment (optional but recommended)
uv venv venvsource venv/bin/activate(macOS/Linux)
venv\Scripts\activate(Windows)
- Install dependencies with uvx
uvx install stagehand python-dotenv pydantic
- Configure environment
- Ensure your
.envfile (at repo root or in this folder) contains:BROWSERBASE_API_KEY
- Ensure your
- Run the script
python main.py
- Initial setup
- Initializes a Stagehand session with Browserbase.
- Prints a live session link for monitoring the browser in real time (when available).
- Link collection
- Navigates to the configured
URL(default:https://www.browserbase.com). - Extracts all links and their link text from the homepage using a Pydantic schema.
- Logs total link count and unique link count after de-duplication.
- Navigates to the configured
- Verification
- Verifies links in batches using
MAX_CONCURRENT_LINKS. - For each link:
- Confirms the page loads successfully.
- For non-social links, extracts:
page_titlecontent_matches(boolean)- short
assessment(max ~8 words)
- For social links, confirms load and skips detailed content checks.
- Verifies links in batches using
- Final report
- Prints a JSON summary including:
- total links
- successful vs failed checks
- per-link details (title, match flag, assessment, and any errors)
- Always closes browser sessions cleanly via context managers.
- Prints a JSON summary including:
- Missing credentials
- Ensure
.envcontainsBROWSERBASE_API_KEY.
- Ensure
- Concurrency limits
MAX_CONCURRENT_LINKS > 1will open multiple browsers in parallel and requires a Browserbase plan that supports concurrency (Startup or Developer or higher).
- Slow or failing pages
- Some links may be slow, geo-restricted, or require auth/consent; these can produce timeouts or error messages in the results.
- Dynamic or JS-heavy sites
- Heavily scripted pages might take longer to reach
"domcontentloaded"; adjust timeouts if needed.
- Heavily scripted pages might take longer to reach
- Social / external redirects
- Social links and complex redirect chains may succeed in loading but not be fully verifiable for content; these are marked as special cases.
- Regression testing: Quickly verify that all key marketing and product links on your homepage still resolve correctly after a deployment.
- Content QA: Detect mismatches between link text and destination page content (e.g., wrong page wired to a CTA).
- SEO and UX audits: Find broken or misdirected links that can harm search rankings or user experience.
- Monitoring: Run this periodically to flag link issues across your marketing site or documentation hub.
MAX_CONCURRENT_LINKSinmain.py- Default:
1→ sequential link verification (works on all plans). - Set to
> 1→ more concurrent link verifications per batch (requires higher Browserbase concurrency limits).
- Default:
- Using Semaphores for advanced control
- For more fine-grained control over concurrency (e.g., rate limiting, prioritization, or per-domain limits), you can wrap link verification in a Semaphore or similar concurrency primitive.
- This lets you:
- Cap how many verifications run at once.
- Smooth out spikes in resource usage.
- Apply different limits for external vs internal links if desired.
- Filter link scopes: Limit verification to specific path prefixes (e.g., only
/docsor/blog) or exclude certain domains. - Recursive crawling: Start from the homepage, follow internal links to secondary/tertiary pages, and cascade link discovery deeper into the site to build a more complete link map.
- Alerting & monitoring: Integrate with Slack, email, or logging tools to notify when links start failing.
- CI integration: Run this in CI and fail builds when a critical link (e.g., signup, pricing, docs) breaks.
- Richer assessments: Expand the extraction schema to capture additional metadata (e.g., HTTP status code, canonical URL, or key headings).
- 📚 Stagehand Docs:
https://docs.stagehand.dev/v2/first-steps/introduction - 🎮 Browserbase:
https://www.browserbase.com - 💡 Try it out:
https://www.browserbase.com/playground - 🔧 Templates:
https://www.browserbase.com/templates - 📧 Need help?:
support@browserbase.com - 💬 Discord:
http://stagehand.dev/discord