Overview

Browser automation is programmatic control of a web browser to navigate pages, fill forms, click buttons, take screenshots, and extract data — without human input. Modern frameworks (Playwright, Puppeteer, Selenium) can drive real browsers (Chromium, Firefox, WebKit), making them indistinguishable from human users for most purposes.

AI agents use browser automation as a primary tool for interacting with services that have no API: filing insurance claims, checking auction bids, submitting web forms, scraping prices, and navigating legacy portals.

Related: AI agents, Browser automation

Core frameworks

Playwright (Microsoft)

The current consensus best choice for AI agent use:

Puppeteer (Google)

Chromium-only; was the dominant tool before Playwright; still widely used. Playwright is generally preferred for new projects.

Selenium

The original; cross-browser; slower and more verbose than modern alternatives; still used in enterprise test suites.

AI agent patterns

Discovering internal APIs

Agents (and Playwright itself) can intercept network requests to find the internal JSON APIs a site uses to render its pages. Once discovered, calling the API directly is faster and more stable than DOM scraping.

In a couple of cases, it found the internal (unpublished) API that the site used to render the page and did some scripting to leverage it directly. — PracticlySpeaking, r/hermesagent, 2026-06-18

Human-in-the-loop claim submission (medical claims example)

  1. Agent parses bill (PDF/image → vision LLM → structured JSON)
  2. Playwright fills the insurance portal form
  3. Agent takes a screenshot of the pre-filled form
  4. Human reviews the screenshot and approves
  5. Agent clicks submit only after explicit confirmation
  6. Agent archives the bill and creates a follow-up reminder

State machine: draft → dry_run → submitted (with –confirm flag required for live-submit)

Secret management

Never hardcode credentials in Playwright scripts used by agents:

Handling fragile portals

Medical and government portals are particularly problematic:

OCR pipeline for browser-submitted documents

For submitting scanned documents via browser:

  1. Input: image or PDF
  2. PDF handling: PyMuPDF renders pages to PNG
  3. OCR: vision LLM receives base64 PNG with a strict system prompt specifying the output JSON schema (provider, date, line items, total, patient identifier)
  4. Validation: sum of line items vs declared total as a confidence heuristic
  5. Output: structured JSON ready for Playwright form-fill

For non-English documents: PaddleOCR handles Chinese, Japanese, Korean, and other scripts well; the structured JSON intermediate step is language-agnostic and can be passed to the same Playwright form-fill code.

Resources