Stagehand + Browserbase: Website Link Tester (Python)

AT A GLANCE

Goal: Crawl a website’s homepage, collect all links, and verify that each link loads successfully and matches its link text.
Link extraction: Uses page.extract() with a Pydantic schema to pull all links and their visible text from the homepage.
Content verification: Opens each link and uses AI to assess whether the page content matches what the link text suggests.
Social link handling: Detects social media domains and only checks that they load (skipping full content verification).
Batch processing: Processes links in batches controlled by MAX_CONCURRENT_LINKS (sequential by default, can be made concurrent).

Stagehand (Python v2): Python client that wraps AI-powered browser automation on top of Browserbase.
Docs → https://docs.stagehand.dev/v3/sdk/python
extract: Extract structured data from web pages using natural language instructions and Pydantic models.
Docs → https://docs.stagehand.dev/basics/extract
concurrent sessions: Run multiple browser sessions at the same time for faster batch processing.
Docs → https://docs.browserbase.com/guides/concurrency-rate-limits

cd into the template
- cd python/website-link-tester
Create & activate a virtual environment (optional but recommended)
- uv venv venv
- source venv/bin/activate (macOS/Linux)
  venv\Scripts\activate (Windows)
Install dependencies with uvx
- uvx install stagehand python-dotenv pydantic
Configure environment
- Ensure your .env file (at repo root or in this folder) contains:
  - BROWSERBASE_API_KEY
Run the script
- python main.py

Initial setup
- Initializes a Stagehand session with Browserbase.
- Prints a live session link for monitoring the browser in real time (when available).
Link collection
- Navigates to the configured URL (default: https://www.browserbase.com).
- Extracts all links and their link text from the homepage using a Pydantic schema.
- Logs total link count and unique link count after de-duplication.
Verification
- Verifies links in batches using MAX_CONCURRENT_LINKS.
- For each link:
  - Confirms the page loads successfully.
  - For non-social links, extracts:
    - page_title
    - content_matches (boolean)
    - short assessment (max ~8 words)
  - For social links, confirms load and skips detailed content checks.
Final report
- Prints a JSON summary including:
  - total links
  - successful vs failed checks
  - per-link details (title, match flag, assessment, and any errors)
- Always closes browser sessions cleanly via context managers.

Missing credentials
- Ensure .env contains BROWSERBASE_API_KEY.
Concurrency limits
- MAX_CONCURRENT_LINKS > 1 will open multiple browsers in parallel and requires a Browserbase plan that supports concurrency (Startup or Developer or higher).
Slow or failing pages
- Some links may be slow, geo-restricted, or require auth/consent; these can produce timeouts or error messages in the results.
Dynamic or JS-heavy sites
- Heavily scripted pages might take longer to reach "domcontentloaded"; adjust timeouts if needed.
Social / external redirects
- Social links and complex redirect chains may succeed in loading but not be fully verifiable for content; these are marked as special cases.

Regression testing: Quickly verify that all key marketing and product links on your homepage still resolve correctly after a deployment.
Content QA: Detect mismatches between link text and destination page content (e.g., wrong page wired to a CTA).
SEO and UX audits: Find broken or misdirected links that can harm search rankings or user experience.
Monitoring: Run this periodically to flag link issues across your marketing site or documentation hub.

MAX_CONCURRENT_LINKS in main.py
- Default: 1 → sequential link verification (works on all plans).
- Set to > 1 → more concurrent link verifications per batch (requires higher Browserbase concurrency limits).
Using Semaphores for advanced control
- For more fine-grained control over concurrency (e.g., rate limiting, prioritization, or per-domain limits), you can wrap link verification in a Semaphore or similar concurrency primitive.
- This lets you:
  - Cap how many verifications run at once.
  - Smooth out spikes in resource usage.
  - Apply different limits for external vs internal links if desired.

Filter link scopes: Limit verification to specific path prefixes (e.g., only /docs or /blog) or exclude certain domains.
Recursive crawling: Start from the homepage, follow internal links to secondary/tertiary pages, and cascade link discovery deeper into the site to build a more complete link map.
Alerting & monitoring: Integrate with Slack, email, or logging tools to notify when links start failing.
CI integration: Run this in CI and fail builds when a critical link (e.g., signup, pricing, docs) breaks.
Richer assessments: Expand the extraction schema to capture additional metadata (e.g., HTTP status code, canonical URL, or key headings).