The world's only agentic web crawler.
Built using the brain of a human that knows about distributed crawling architectures.
Endpoints · Ghost Protocol · Live Stream · MCP Tools · Quick Start · Architecture
Grub Crawler gets dirty so you don't have to. It penetrates every layer of protection — Cloudflare, CAPTCHAs, JavaScript walls — fingers deep in the DOM until it finds what it came for. When the front door's locked, Ghost Protocol slips in the back, takes pictures of everything, and lets the AI read it naked. Multi-provider? Oh yeah — it'll ride OpenAI, Anthropic, and Ollama all in the same session. No safeword. No cooldown. Just raw, unfiltered content extraction that leaves every page fully exposed and dripping with markdown.
| Traditional Crawlers | Grub Crawler | |
|---|---|---|
| Anti-bot bypass | ❌ | ✅ Ghost Protocol (vision AI) |
| Autonomous browsing | ❌ | ✅ Agent loop with planning |
| Multi-page reasoning | ❌ | ✅ Bounded state machine |
| LLM fallback rotation | ❌ | ✅ OpenAI / Anthropic / Ollama |
| Policy enforcement | ❌ | ✅ Domain gates, secret redaction |
| Live browser stream | ❌ | ✅ CDP screencast over WebSocket/MJPEG |
| Replayable traces | ❌ | ✅ Full JSON trace per run |
| Method | Path | Description | Status |
|---|---|---|---|
POST |
/api/crawl |
Single URL crawl (HTML + markdown) | Live |
POST |
/api/markdown |
Single or multi-URL markdown extraction | Live |
POST |
/api/batch |
Batch crawl with job tracking | Live |
POST |
/api/raw |
Raw HTML extraction (no markdown) | Live |
GET |
/view |
Browser-rendered HTML viewer | Live |
GET |
/download |
File download (PDFs, etc.) through crawler | Live |
| Method | Path | Description | Status |
|---|---|---|---|
POST |
/api/agent/run |
Submit task to autonomous agent loop | Live |
GET |
/api/agent/status/{run_id} |
Check agent run status / load trace | Live |
POST |
/api/agent/ghost |
Ghost Protocol: screenshot + vision extract | Live |
| Method | Path | Description | Status |
|---|---|---|---|
POST |
/api/jobs/create |
Generic job submission | Live |
POST |
/api/jobs/crawl |
Submit single URL crawl job | Live |
POST |
/api/jobs/batch-crawl |
Submit batch crawl job | Live |
POST |
/api/jobs/markdown |
Submit markdown-only job | Live |
POST |
/api/jobs/process-job |
Cloud Tasks worker endpoint | Live |
POST |
/api/wraith |
AI-driven crawl workflow | Placeholder |
| Method | Path | Description | Status |
|---|---|---|---|
POST |
/api/cache/search |
Fuzzy search cached content | Live |
GET |
/api/cache/list |
List cached document metadata | Live |
GET |
/api/cache/doc/{doc_id} |
Fetch one cached document | Live |
POST |
/api/cache/upsert |
Upsert cache entries | Live |
POST |
/api/cache/prune |
Prune cache entries by TTL/domain | Live |
| Method | Path | Description | Status |
|---|---|---|---|
GET |
/api/sessions/{session_id}/files |
List session files | Live |
GET |
/api/sessions/{session_id}/file |
Get specific file | Live |
GET |
/api/sessions/{session_id}/status |
Session progress status | Live |
GET |
/api/sessions/{session_id}/results |
All crawl results | Live |
GET |
/api/sessions/{session_id}/screenshots |
List screenshots | Live |
| Method | Path | Description | Status |
|---|---|---|---|
WS |
/stream/{session_id} |
WebSocket viewport stream | Live |
GET |
/stream/{session_id}/mjpeg |
MJPEG fallback stream | Live |
GET |
/stream/{session_id}/status |
Stream session status | Live |
GET |
/stream/pool/status |
Browser pool status | Live |
| Method | Path | Description | Status |
|---|---|---|---|
GET |
/health |
Health check + tool count | Live |
GET |
/tools |
List registered AHP tools | Live |
GET |
/{tool_name} |
Execute AHP tool (catch-all) | Live |
The MCP bridge exposes all capabilities to any MCP-compatible host:
| Tool | Description | Status |
|---|---|---|
crawl_url |
Single URL markdown extraction with JS injection | Live |
crawl_batch |
Batch processing up to 50 URLs with collation | Live |
raw_html |
Raw HTML fetch without conversion | Live |
download_file |
Download files (PDFs, etc.) through crawler | Live |
crawl_validate |
Content quality assessment | Live |
crawl_search |
Fuzzy search local crawl cache | Live |
crawl_cache_list |
List local cached files | Live |
crawl_remote_search |
Search remote crawler cache | Live |
crawl_remote_cache_list |
List remote cache entries | Live |
crawl_remote_cache_doc |
Fetch remote cached document | Live |
agent_run |
Submit task to autonomous agent (Mode B) | Live |
agent_status |
Check agent run status | Live |
ghost_extract |
Ghost Protocol: screenshot + vision AI extraction | Live |
set_auth_token |
Save auth token to .wraithenv | Live |
crawl_status |
Report configuration and connection | Live |
| File | Purpose | Status |
|---|---|---|
types.py |
RunState enum, StopReason, ToolCall, ToolResult, AssistantAction, RunConfig, RunContext, StepTrace, RunResult |
Done |
errors.py |
Typed errors: validation_error, policy_denied, tool_timeout, tool_unavailable, execution_error, provider_error, stop_condition |
Done |
dispatcher.py |
Tool validation, timeout enforcement (30s), retry (1x), typed error normalization | Done |
engine.py |
Bounded loop: plan -> execute -> observe -> stop. EventBus integration. Returns (RunResult, RunSummary) |
Done |
ghost.py |
Ghost Protocol: block detection, screenshot capture, vision extraction, auto-trigger | Done |
| File | Purpose | Status |
|---|---|---|
base.py |
LLMAdapter ABC, FallbackAdapter (rotate on failure), factory functions |
Done |
openai_adapter.py |
OpenAI tool_calls mapping, GPT-4o vision | Done |
anthropic_adapter.py |
Anthropic tool_use/tool_result blocks, Claude Sonnet vision | Done |
ollama_adapter.py |
Ollama HTTP /api/chat, llava vision |
Done |
| File | Purpose | Status |
|---|---|---|
domain.py |
Domain allowlist, RFC-1918/loopback/link-local deny | Done |
gate.py |
Pre-tool and pre-fetch policy checks with PolicyVerdict |
Done |
redaction.py |
Secret pattern redaction (API keys, JWTs, private keys) | Done |
| File | Purpose | Status |
|---|---|---|
events.py |
EventBus + 7 typed events: run_start, step_start, tool_dispatch, tool_result, policy_denied, step_end, run_end |
Done |
trace.py |
TraceCollector, RunSummary JSON serialization, persist_trace() / load_trace() via storage |
Done |
| File | Purpose | Status |
|---|---|---|
agent_routes.py |
POST /api/agent/run, GET /api/agent/status/{run_id}. 503 when disabled |
Done |
routes.py |
Core crawl/markdown/batch/cache REST endpoints | Done |
job_routes.py |
Job CRUD, session status, Cloud Tasks worker | Done |
jobs.py |
JobType enum (incl. AGENT_RUN), JobManager, JobProcessor |
Done |
models.py |
All Pydantic models incl. AgentRunRequest/Response |
Done |
| File | Purpose | Status |
|---|---|---|
config.py |
All env vars incl. agent + provider + ghost config | Done |
storage.py |
User-partitioned storage (local filesystem / GCS) | Done |
crawler.py |
Playwright crawling engine | Done |
markdown.py |
HTML to markdown conversion | Done |
browser.py |
Browser automation utilities | Done |
browser_pool.py |
Persistent Chromium pool with lease/return pattern | Done |
stream.py |
CDP screencast → WebSocket/MJPEG relay + interactive commands | Done |
INIT -> PLAN -> EXECUTE_TOOL -> OBSERVE -> PLAN -> ... -> RESPOND -> STOP
| |
+-- policy_denied ---------------------->+
+-- max_steps / max_wall_time / max_failures -> STOP
+-- no_op_loop (3x empty) ------------> STOP
+-- blocked (ghost trigger) -----------> GHOST -> OBSERVE
Stop conditions enforced every iteration:
max_steps(default: 12)max_wall_time(default: 90s)max_failures(default: 3)no_op_loop(3 consecutive empty responses)policy_denied(blocked tool/domain)completed(agent responds with text)
When a crawl result signals an anti-bot block (Cloudflare challenge, CAPTCHA, empty SPA shell), the agent can switch to cloak mode:
- Take a full-page screenshot via Playwright
- Send the image to a vision-capable LLM (Claude Sonnet or GPT-4o)
- Extract content from the rendered pixels
- Return extracted text with
render_mode: "ghost"in the trace
This bypasses DOM-based anti-bot detection entirely.
Requires AGENT_GHOST_ENABLED=true. Auto-triggers on detected blocks when AGENT_GHOST_AUTO_TRIGGER=true.
Watch the crawler work in real-time. A persistent pool of warm Chromium instances streams viewport frames over WebSocket or MJPEG.
WebSocket — connect and send interactive commands:
const ws = new WebSocket("ws://localhost:8080/stream/my-session?url=https://example.com");
ws.onmessage = (e) => {
const msg = JSON.parse(e.data);
if (msg.type === "frame") document.getElementById("viewport").src = "data:image/jpeg;base64," + msg.data;
};
// Navigate, click, scroll, type — all over the same socket
ws.send(JSON.stringify({ action: "navigate", url: "https://example.com/pricing" }));
ws.send(JSON.stringify({ action: "click", selector: "#signup-btn" }));
ws.send(JSON.stringify({ action: "scroll", direction: "down" }));MJPEG — drop it in an <img> tag, instant video:
<img src="http://localhost:8080/stream/my-session/mjpeg?url=https://example.com" />Requires BROWSER_STREAM_ENABLED=true. Each Chromium instance uses ~150-300MB RAM.
git clone <repo>
cd grub-crawl
cp .env.example .env
pip install -r requirements.txt
uvicorn app.main:app --reload --host 0.0.0.0 --port 8080# Add to .env
AGENT_ENABLED=true
OPENAI_API_KEY=sk-...
# or
ANTHROPIC_API_KEY=sk-ant-...
AGENT_PROVIDER=anthropiccurl -X POST http://localhost:8080/api/agent/run \
-H "Content-Type: application/json" \
-d '{
"task": "Find the pricing page on example.com and extract plan details",
"max_steps": 10,
"allowed_domains": ["example.com"]
}'# Add to .env
AGENT_GHOST_ENABLED=true
curl -X POST http://localhost:8080/api/agent/ghost \
-H "Content-Type: application/json" \
-d '{"url": "https://blocked-site.com"}'# Add to .env
BROWSER_STREAM_ENABLED=true
BROWSER_POOL_SIZE=2
# MJPEG (open in browser)
open "http://localhost:8080/stream/demo/mjpeg?url=https://example.com"HOST(default: 0.0.0.0)PORT(default: 8080)DEBUG(default: false)
STORAGE_PATH(default: ./storage)RUNNING_IN_CLOUD(default: false)GCS_BUCKET_NAMEGOOGLE_CLOUD_PROJECT
DISABLE_AUTH(default: false)GNOSIS_AUTH_URL(default: http://gnosis-auth:5000)
MAX_CONCURRENT_CRAWLS(default: 5)CRAWL_TIMEOUT(default: 30)ENABLE_JAVASCRIPT(default: true)ENABLE_SCREENSHOTS(default: false)
AGENT_ENABLED(default: false)AGENT_MAX_STEPS(default: 12)AGENT_MAX_WALL_TIME_MS(default: 90000)AGENT_MAX_FAILURES(default: 3)AGENT_ALLOWED_TOOLS— comma-separated allowlistAGENT_ALLOWED_DOMAINS— comma-separated allowlistAGENT_BLOCK_PRIVATE_RANGES(default: true)AGENT_REDACT_SECRETS(default: true)
AGENT_PROVIDER— openai | anthropic | ollama (default: openai)OPENAI_API_KEYOPENAI_MODEL(default: gpt-4.1-mini)ANTHROPIC_API_KEYANTHROPIC_MODEL(default: claude-3-5-sonnet-latest)OLLAMA_BASE_URL(default: http://localhost:11434)OLLAMA_MODEL(default: llama3.1:8b-instruct)
AGENT_GHOST_ENABLED(default: false)AGENT_GHOST_AUTO_TRIGGER(default: true)AGENT_GHOST_VISION_PROVIDER— inherits from AGENT_PROVIDERAGENT_GHOST_MAX_IMAGE_WIDTH(default: 1280)
BROWSER_POOL_SIZE(default: 1)BROWSER_STREAM_ENABLED(default: false)BROWSER_STREAM_QUALITY(default: 25) — JPEG quality 1-100BROWSER_STREAM_MAX_WIDTH(default: 854)BROWSER_STREAM_MAX_LEASE_SECONDS(default: 300)
POST /api/markdown returns:
success, url, final_url, status_code, markdown, markdown_plain, content, render_mode, wait_strategy, timings_ms, blocked, block_reason, captcha_detected, http_error_family, body_char_count, body_word_count, content_quality, extractor_version, normalized_url, content_hash
blocked— anti-bot/captcha/challengeempty— very low signalminimal— thin/error pagessufficient— usable for summarization
Do not summarize unless content_quality == "sufficient".
{"error": "http_error|validation_error|internal_error", "status": 400, "details": {}}- Agent core — state machine, types, errors (W1)
- Unified tool contract — dispatcher with timeout/retry (W2)
- Policy gates — domain allowlist, private-range deny, redaction (W3)
- Observability — EventBus, TraceCollector, RunSummary persistence (W4)
- API wiring —
/api/agent/run,/api/agent/status, JobType.AGENT_RUN (W5) - Provider adapters — OpenAI, Anthropic, Ollama with fallback (W6)
- Config flags — agent, provider, ghost, stream settings (W7)
- Cloak-mode trigger detection (W8)
- Screenshot capture pipeline (W8)
- Vision extraction via Claude/GPT-4o (W8)
- Fallback chain in engine (W8)
- Ghost tool for external callers (W8)
- Ghost MCP tool + REST endpoint (W8)
- Persistent browser pool with lease/return (W9)
- CDP screencast relay (W9)
- WebSocket endpoint with interactive commands (W9)
- MJPEG fallback stream (W9)
- Stream status + pool status endpoints (W9)
- Comprehensive test suite
- Error handling improvements
- Monitoring and alerting
- Performance optimization
See MASTER_PLAN.md for the full architecture plan.
Grub Crawler Project License