Parse API
Transform natural language requests into structured web data using AI-powered parsing workflows that automatically crawl or scrape websites.
Overview
The Parse API revolutionizes web data extraction by understanding natural language prompts and intelligently orchestrating the extraction process:
- Prompt-Driven: Natural language instructions like "Crawl blog for recent posts"
- Intelligent Workflows: AI decides whether to scrape single pages or crawl multiple pages
- Flexible Schemas: Support for any JSON schema from simple to deeply nested structures
- Multiple Formats: JSON, CSV, or Markdown output
- Streaming Support: Real-time results for large crawling operations
- LLM-Powered: Uses Gemini, OpenAI, Claude for intelligent extraction
Quick Start
Simple Product Extraction
curl -X POST "https://api.supacrawler.com/v1/parse" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"prompt": "Extract product information from https://shop.example.com/iphone",
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" },
"in_stock": { "type": "boolean" }
},
"required": ["name", "price"]
},
"output_format": "json"
}'
Response
{
"success": true,
"data": {
"name": "iPhone 15 Pro",
"price": 999,
"in_stock": true
},
"workflow_status": "completed",
"pages_processed": 1,
"execution_time": 2400
}
Request Parameters
- Name
prompt
- Type
- string
- Description
Natural language instruction that may include URLs and extraction requirements. Examples: "Crawl example.com for blog posts", "Extract contact info from company page"
- Name
schema
- Type
- object
- Description
Optional JSON schema defining the expected output structure. Supports any valid JSON schema including nested objects, arrays, and complex validation rules.
- Name
output_format
- Type
- string
- Description
Preferred output format:
"json"
(default),"csv"
, or"markdown"
.
- Name
stream
- Type
- boolean
- Description
Enable streaming responses for real-time results during crawling operations (default: false).
- Name
max_depth
- Type
- integer
- Description
Maximum crawl depth for link following (1-3, default: 1). Only applies when AI decides to crawl.
- Name
max_pages
- Type
- integer
- Description
Maximum pages to process (1-100, default: 10). Prevents runaway crawling operations.
How It Works: Intelligent Workflow
1. Prompt Analysis
The AI analyzes your natural language prompt to understand:
- Action Type: Whether to scrape a single page or crawl multiple pages
- URLs: Extracts mentioned URLs automatically
- Output Format: Detects preferences like "CSV" or "JSON" from context
- Extraction Goal: Understands what data you want to extract
2. Smart Data Collection
Based on the analysis, the system:
- Scrapes: Single page extraction for specific URLs
- Crawls: Multi-page discovery and extraction for comprehensive requests
- Streams: Provides real-time updates for large operations
3. AI Data Extraction
- Uses advanced LLMs to extract structured data from each page
- Follows your JSON schema precisely if provided
- Validates output format and structure automatically
4. Response Formatting
- Aggregates data from all processed pages
- Formats according to your preferences (JSON/CSV/Markdown)
- Provides execution metadata and status tracking
Example Use Cases
Blog Post Crawling
Extract recent blog posts with metadata:
curl -X POST "https://api.supacrawler.com/v1/parse" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"prompt": "Crawl https://example.com/blog and give me the 5 most recent posts in CSV format",
"schema": {
"type": "object",
"properties": {
"posts": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": "string" },
"date": { "type": "string" },
"author": { "type": "string" },
"url": { "type": "string" }
},
"required": ["title", "date", "url"]
}
}
}
},
"output_format": "csv",
"max_pages": 5
}'
Response:
title,date,author,url
"Latest AI Trends","2024-01-15","John Doe","https://example.com/blog/ai-trends"
"Web Development Tips","2024-01-12","Jane Smith","https://example.com/blog/web-dev"
E-commerce Product Catalog
Extract product information from shopping sites:
curl -X POST "https://api.supacrawler.com/v1/parse" \
-d '{
"prompt": "Extract all product details from https://shop.example.com/category/electronics",
"schema": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" },
"rating": { "type": "number" },
"features": {
"type": "array",
"items": { "type": "string" }
},
"availability": { "type": "boolean" }
}
}
}
}
},
"max_pages": 20
}'
Contact Information Extraction
Extract contact details from company pages:
curl -X POST "https://api.supacrawler.com/v1/parse" \
-d '{
"prompt": "Find contact information including email, phone, and social links from https://company.example.com/contact",
"schema": {
"type": "object",
"properties": {
"email": { "type": "string" },
"phone": { "type": "string" },
"address": { "type": "string" },
"social_links": {
"type": "object",
"properties": {
"twitter": { "type": "string" },
"linkedin": { "type": "string" },
"facebook": { "type": "string" }
}
}
}
}
}'
Job Listings Aggregation
Collect job postings with detailed requirements:
curl -X POST "https://api.supacrawler.com/v1/parse" \
-d '{
"prompt": "Crawl job board for software engineering positions and extract details",
"schema": {
"type": "object",
"properties": {
"jobs": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": "string" },
"company": { "type": "string" },
"location": { "type": "string" },
"salary_range": { "type": "string" },
"remote": { "type": "boolean" },
"requirements": {
"type": "array",
"items": { "type": "string" }
},
"posted_date": { "type": "string" }
}
}
}
}
},
"output_format": "json",
"max_depth": 2
}'
Advanced Features
Streaming Responses
For large crawling operations, enable streaming to receive results in real-time:
curl -X POST "https://api.supacrawler.com/v1/parse" \
-d '{
"prompt": "Crawl entire news site for article headlines",
"stream": true,
"max_pages": 50
}'
Streaming Response Format:
{
"success": true,
"workflow_status": "crawling",
"pages_processed": 15,
"total_pages": 50,
"partial_results": [
{"title": "Breaking News", "url": "..."},
{"title": "Sports Update", "url": "..."}
]
}
Complex Nested Schemas
The API supports arbitrarily complex JSON schemas:
{
"schema": {
"type": "object",
"properties": {
"company": {
"type": "object",
"properties": {
"name": { "type": "string" },
"departments": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"employees": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"role": { "type": "string" },
"skills": {
"type": "array",
"items": { "type": "string" }
}
}
}
}
}
}
}
}
}
}
}
}
Intelligent Action Detection
The AI automatically determines the best approach:
Prompt Keywords | Action | Behavior |
---|---|---|
"crawl", "all pages", "entire site" | Crawl | Discovers and processes multiple pages |
"extract", "get", "parse" + single URL | Scrape | Processes only the specified page |
"CSV", "spreadsheet", "table" | CSV Output | Formats data as comma-separated values |
"JSON", "structured data" | JSON Output | Returns structured JSON objects |
Response Format
- Name
success
- Type
- boolean
- Description
Whether the parsing operation succeeded.
- Name
data
- Type
- object | string
- Description
Extracted data as JSON object, CSV string, or markdown text.
- Name
workflow_status
- Type
- string
- Description
Current workflow stage:
analyzing
,scraping
,crawling
,extracting
,formatting
,completed
,failed
.
- Name
pages_processed
- Type
- integer
- Description
Number of pages successfully processed.
- Name
total_pages
- Type
- integer
- Description
Total pages discovered (if known). Only available during crawling operations.
- Name
partial_results
- Type
- array
- Description
Incremental results for streaming responses. Each item contains extracted data from individual pages.
- Name
execution_time
- Type
- integer
- Description
Total execution time in milliseconds.
- Name
error
- Type
- string
- Description
Error message if parsing failed.
Available Templates & Examples
Get information about supported templates and example schemas:
Templates Endpoint
curl -X GET "https://api.supacrawler.com/v1/parse/templates" \
-H "Authorization: Bearer YOUR_API_KEY"
Response:
{
"success": true,
"templates": {
"workflow_prompt": "Intelligent prompt-based parsing with automatic crawl/scrape detection",
"streaming": "Real-time streaming results as content is processed",
"schema_based": "Structured extraction using user-provided JSON schemas"
},
"content_types": ["any"],
"output_formats": ["json", "csv", "markdown"]
}
Examples Endpoint
curl -X GET "https://api.supacrawler.com/v1/parse/examples" \
-H "Authorization: Bearer YOUR_API_KEY"
Response:
{
"success": true,
"examples": {
"blog_crawl_example": {
"prompt": "Crawl https://example.com/blog and give me the 5 most recent posts in CSV.",
"schema": {
"type": "object",
"properties": {
"posts": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": "string" },
"date": { "type": "string" },
"url": { "type": "string" }
},
"required": ["title", "date", "url"]
}
}
},
"required": ["posts"]
}
},
"product_scrape_example": {
"prompt": "Extract product information from https://shop.example.com/product/123",
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" },
"description": { "type": "string" },
"in_stock": { "type": "boolean" }
}
}
}
}
}
Error Handling
Common Errors
- Name
400 - Invalid Request
- Description
Missing required
prompt
field or malformed JSON schema.
- Name
422 - Unprocessable Entity
- Description
AI couldn't understand the prompt or extract meaningful data.
- Name
500 - Internal Server Error
- Description
LLM provider API failure, rate limit exceeded, or system error.
Error Response Format
{
"success": false,
"error": "Failed to extract data: No content found at specified URLs",
"workflow_status": "failed",
"pages_processed": 0,
"execution_time": 1200
}
Graceful Degradation
The system automatically handles failures:
- Crawl fails → Falls back to single page scraping
- Schema validation fails → Returns raw extracted text with warning
- URL inaccessible → Skips problematic URLs, continues with others
- LLM timeout → Retries with exponential backoff
LLM Model Configuration
Supported Models
Provider | Models | Default | Configuration |
---|---|---|---|
Gemini | gemini-1.5-flash, gemini-1.5-pro | ✅ gemini-1.5-flash | GEMINI_API_KEY |
OpenAI | gpt-4, gpt-4-turbo, gpt-3.5-turbo | gpt-4 | OPENAI_API_KEY |
Claude | claude-3-sonnet, claude-3-haiku | claude-3-sonnet | CLAUDE_API_KEY |
Model Selection Strategy
The system automatically selects models based on task complexity:
- Simple extraction → Faster, cheaper models (Gemini Flash, GPT-3.5)
- Complex schemas → More capable models (GPT-4, Claude Sonnet)
- Large content → Models with larger context windows
Best Practices
Prompt Design
- Be Specific: "Extract product names and prices from the electronics category" vs "Get product info"
- Include URLs: Always mention the target URLs in your prompt
- Specify Format: Add "in CSV format" or "as JSON" to guide output formatting
- Set Limits: Use
max_pages
to prevent excessive crawling
Schema Design
- Use Proper Types: Specify
"number"
,"boolean"
,"array"
instead of just"string"
- Mark Required Fields: Include
"required"
arrays for essential data - Nested Structures: Support complex data with nested objects and arrays
- Validation Rules: Add
"pattern"
,"minimum"
,"maximum"
for data validation
Performance Optimization
- Streaming for Large Operations: Enable
stream: true
for 10+ pages - Reasonable Limits: Don't set
max_pages
higher than needed - Cache-Friendly: Similar prompts benefit from internal caching
- Schema Reuse: Consistent schemas improve extraction accuracy
Rate Limits & Pricing
Plan | Requests/minute | Requests/day | Pages/request | Streaming |
---|---|---|---|---|
Free | 10 | 100 | 5 | ❌ |
Pro | 100 | 10,000 | 50 | ✅ |
Enterprise | 1,000 | 100,000 | 100 | ✅ |
Rate Limit Headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1640995200
Migration from Legacy Parse API
Before (Legacy HTML Parsing)
{
"html_content": "<div>Product content...</div>",
"user_prompt": "Extract product details",
"output_spec": { "name": "string", "price": "number" },
"output_format": "json"
}
After (Intelligent Workflow)
{
"prompt": "Extract product details from https://shop.example.com/product",
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" }
}
},
"output_format": "json"
}
Key Differences
- Prompt-driven: Natural language instructions instead of separate content + prompt
- URL-based: URLs in prompts instead of pre-scraped content
- Schema-based: Full JSON schema instead of simple type specifications
- Workflow-aware: Automatic scrape vs crawl decisions based on prompt analysis
Local Development & Testing
Test with Example Prompts
# Simple extraction
curl -X POST http://localhost:8081/v1/parse \
-H "Content-Type: application/json" \
-d '{
"prompt": "Extract the title and description from https://example.com"
}'
# Bulk extraction
curl -X POST http://localhost:8081/v1/parse \
-d '{
"prompt": "Crawl https://news.ycombinator.com and get the top 10 story titles",
"output_format": "csv",
"max_pages": 10
}'
# Complex schema
curl -X POST http://localhost:8081/v1/parse \
-d '{
"prompt": "Extract all team member info from https://company.example.com/team",
"schema": {
"type": "object",
"properties": {
"team": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"role": { "type": "string" },
"bio": { "type": "string" }
}
}
}
}
}
}'
Environment Variables
# Required: LLM Provider API Keys
GEMINI_API_KEY=your_gemini_key
OPENAI_API_KEY=your_openai_key # Optional
CLAUDE_API_KEY=your_claude_key # Optional
# Optional: Service Configuration
PARSE_MAX_RETRIES=3
PARSE_TIMEOUT=30s
PARSE_DEFAULT_MODEL=gemini-1.5-flash
The Parse API represents a significant evolution in web data extraction, moving from static HTML parsing to intelligent, AI-driven workflows that understand natural language and automatically orchestrate complex data collection processes.