Overview
Browser automation is programmatic control of a web browser to navigate pages, fill forms, click buttons, take screenshots, and extract data — without human input. Modern frameworks (Playwright, Puppeteer, Selenium) can drive real browsers (Chromium, Firefox, WebKit), making them indistinguishable from human users for most purposes.
AI agents use browser automation as a primary tool for interacting with services that have no API: filing insurance claims, checking auction bids, submitting web forms, scraping prices, and navigating legacy portals.
Related: AI agents, Browser automation
Core frameworks
Playwright (Microsoft)
The current consensus best choice for AI agent use:
- Supports Chromium, Firefox, and WebKit with a single API
- Built-in
codegencommand records browser interactions and outputs selectors; fastest way to bootstrap form-submission scripts - Screenshot API included (essential for human-in-the-loop approval flows)
- Good handling of SPAs and dynamic content via auto-waiting
Puppeteer (Google)
Chromium-only; was the dominant tool before Playwright; still widely used. Playwright is generally preferred for new projects.
Selenium
The original; cross-browser; slower and more verbose than modern alternatives; still used in enterprise test suites.
AI agent patterns
Discovering internal APIs
Agents (and Playwright itself) can intercept network requests to find the internal JSON APIs a site uses to render its pages. Once discovered, calling the API directly is faster and more stable than DOM scraping.
In a couple of cases, it found the internal (unpublished) API that the site used to render the page and did some scripting to leverage it directly. — PracticlySpeaking, r/hermesagent, 2026-06-18
Human-in-the-loop claim submission (medical claims example)
- Agent parses bill (PDF/image → vision LLM → structured JSON)
- Playwright fills the insurance portal form
- Agent takes a screenshot of the pre-filled form
- Human reviews the screenshot and approves
- Agent clicks submit only after explicit confirmation
- Agent archives the bill and creates a follow-up reminder
State machine: draft → dry_run → submitted (with –confirm flag required for live-submit)
Secret management
Never hardcode credentials in Playwright scripts used by agents:
- Use a password manager (1Password, Bitwarden) via CLI or env vars
- Time-box sensitive access: grant write permissions for one hour, revoke after the task completes
- For personal-data portals that use PII (DOB, name) instead of passwords: store PII in a vault, not in plain-text config files
Handling fragile portals
Medical and government portals are particularly problematic:
- Slow page loads: use Playwright’s auto-wait rather than fixed sleeps
- Session timeouts: build reconnect logic
- CAPTCHAs: alert the human rather than attempting to solve
- Rate limits: add random delays and respect robots.txt where appropriate
OCR pipeline for browser-submitted documents
For submitting scanned documents via browser:
- Input: image or PDF
- PDF handling: PyMuPDF renders pages to PNG
- OCR: vision LLM receives base64 PNG with a strict system prompt specifying the output JSON schema (provider, date, line items, total, patient identifier)
- Validation: sum of line items vs declared total as a confidence heuristic
- Output: structured JSON ready for Playwright form-fill
For non-English documents: PaddleOCR handles Chinese, Japanese, Korean, and other scripts well; the structured JSON intermediate step is language-agnostic and can be passed to the same Playwright form-fill code.
Resources
- 2026-06-18 ◦ Am I missing the point of AI agents? (Reddit r/hermesagent) — dontyasay describes a complete medical claims automation pipeline using Playwright + vision LLM OCR + dry-run mode + HSA ledger; PracticlySpeaking describes using Playwright’s codegen and internal API discovery for email pipeline automation
- Playwright documentation — official docs; codegen, screenshot API, browser contexts