Parse API

Transform natural language requests into structured web data using AI-powered parsing workflows that automatically crawl or scrape websites.

Overview

The Parse API revolutionizes web data extraction by understanding natural language prompts and intelligently orchestrating the extraction process:

Prompt-Driven: Natural language instructions like "Crawl blog for recent posts"
Intelligent Workflows: AI decides whether to scrape single pages or crawl multiple pages
Flexible Schemas: Support for any JSON schema from simple to deeply nested structures
Multiple Formats: JSON, CSV, or Markdown output
Streaming Support: Real-time results for large crawling operations
LLM-Powered: Uses Gemini, OpenAI, Claude for intelligent extraction

Quick Start

Simple Product Extraction

curl -X POST "https://api.supacrawler.com/v1/parse" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "prompt": "Extract product information from https://shop.example.com/iphone",
    "schema": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "price": { "type": "number" },
        "in_stock": { "type": "boolean" }
      },
      "required": ["name", "price"]
    },
    "output_format": "json"
  }'

Response

{
  "success": true,
  "data": {
    "name": "iPhone 15 Pro",
    "price": 999,
    "in_stock": true
  },
  "workflow_status": "completed",
  "pages_processed": 1,
  "execution_time": 2400
}

Request Parameters

Name
prompt
Type
string
Description
Natural language instruction that may include URLs and extraction requirements. Examples: "Crawl example.com for blog posts", "Extract contact info from company page"
Name
schema
Type
object
Description
Optional JSON schema defining the expected output structure. Supports any valid JSON schema including nested objects, arrays, and complex validation rules.
Name
output_format
Type
string
Description
Preferred output format: "json" (default), "csv", or "markdown".
Name
stream
Type
boolean
Description
Enable streaming responses for real-time results during crawling operations (default: false).
Name
max_depth
Type
integer
Description
Maximum crawl depth for link following (1-3, default: 1). Only applies when AI decides to crawl.
Name
max_pages
Type
integer
Description
Maximum pages to process (1-100, default: 10). Prevents runaway crawling operations.

How It Works: Intelligent Workflow

1. Prompt Analysis

The AI analyzes your natural language prompt to understand:

Action Type: Whether to scrape a single page or crawl multiple pages
URLs: Extracts mentioned URLs automatically
Output Format: Detects preferences like "CSV" or "JSON" from context
Extraction Goal: Understands what data you want to extract

2. Smart Data Collection

Based on the analysis, the system:

Scrapes: Single page extraction for specific URLs
Crawls: Multi-page discovery and extraction for comprehensive requests
Streams: Provides real-time updates for large operations

3. AI Data Extraction

Uses advanced LLMs to extract structured data from each page
Follows your JSON schema precisely if provided
Validates output format and structure automatically

4. Response Formatting

Aggregates data from all processed pages
Formats according to your preferences (JSON/CSV/Markdown)
Provides execution metadata and status tracking

Example Use Cases

Blog Post Crawling

Extract recent blog posts with metadata:

curl -X POST "https://api.supacrawler.com/v1/parse" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "prompt": "Crawl https://example.com/blog and give me the 5 most recent posts in CSV format",
    "schema": {
      "type": "object",
      "properties": {
        "posts": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "title": { "type": "string" },
              "date": { "type": "string" },
              "author": { "type": "string" },
              "url": { "type": "string" }
            },
            "required": ["title", "date", "url"]
          }
        }
      }
    },
    "output_format": "csv",
    "max_pages": 5
  }'

Response:

title,date,author,url
"Latest AI Trends","2024-01-15","John Doe","https://example.com/blog/ai-trends"
"Web Development Tips","2024-01-12","Jane Smith","https://example.com/blog/web-dev"

E-commerce Product Catalog

Extract product information from shopping sites:

curl -X POST "https://api.supacrawler.com/v1/parse" \
  -d '{
    "prompt": "Extract all product details from https://shop.example.com/category/electronics",
    "schema": {
      "type": "object",
      "properties": {
        "products": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "price": { "type": "number" },
              "rating": { "type": "number" },
              "features": { 
                "type": "array", 
                "items": { "type": "string" }
              },
              "availability": { "type": "boolean" }
            }
          }
        }
      }
    },
    "max_pages": 20
  }'

Contact Information Extraction

Extract contact details from company pages:

curl -X POST "https://api.supacrawler.com/v1/parse" \
  -d '{
    "prompt": "Find contact information including email, phone, and social links from https://company.example.com/contact",
    "schema": {
      "type": "object",
      "properties": {
        "email": { "type": "string" },
        "phone": { "type": "string" },
        "address": { "type": "string" },
        "social_links": {
          "type": "object",
          "properties": {
            "twitter": { "type": "string" },
            "linkedin": { "type": "string" },
            "facebook": { "type": "string" }
          }
        }
      }
    }
  }'

Job Listings Aggregation

Collect job postings with detailed requirements:

curl -X POST "https://api.supacrawler.com/v1/parse" \
  -d '{
    "prompt": "Crawl job board for software engineering positions and extract details",
    "schema": {
      "type": "object",
      "properties": {
        "jobs": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "title": { "type": "string" },
              "company": { "type": "string" },
              "location": { "type": "string" },
              "salary_range": { "type": "string" },
              "remote": { "type": "boolean" },
              "requirements": {
                "type": "array",
                "items": { "type": "string" }
              },
              "posted_date": { "type": "string" }
            }
          }
        }
      }
    },
    "output_format": "json",
    "max_depth": 2
  }'

Advanced Features

Streaming Responses

For large crawling operations, enable streaming to receive results in real-time:

curl -X POST "https://api.supacrawler.com/v1/parse" \
  -d '{
    "prompt": "Crawl entire news site for article headlines",
    "stream": true,
    "max_pages": 50
  }'

Streaming Response Format:

{
  "success": true,
  "workflow_status": "crawling",
  "pages_processed": 15,
  "total_pages": 50,
  "partial_results": [
    {"title": "Breaking News", "url": "..."},
    {"title": "Sports Update", "url": "..."}
  ]
}

Complex Nested Schemas

The API supports arbitrarily complex JSON schemas:

{
  "schema": {
    "type": "object",
    "properties": {
      "company": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "departments": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "name": { "type": "string" },
                "employees": {
                  "type": "array",
                  "items": {
                    "type": "object",
                    "properties": {
                      "name": { "type": "string" },
                      "role": { "type": "string" },
                      "skills": {
                        "type": "array",
                        "items": { "type": "string" }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Intelligent Action Detection

The AI automatically determines the best approach:

Prompt Keywords	Action	Behavior
"crawl", "all pages", "entire site"	Crawl	Discovers and processes multiple pages
"extract", "get", "parse" + single URL	Scrape	Processes only the specified page
"CSV", "spreadsheet", "table"	CSV Output	Formats data as comma-separated values
"JSON", "structured data"	JSON Output	Returns structured JSON objects

Response Format

Name
success
Type
boolean
Description
Whether the parsing operation succeeded.
Name
data
Type
object | string
Description
Extracted data as JSON object, CSV string, or markdown text.
Name
workflow_status
Type
string
Description
Current workflow stage: analyzing, scraping, crawling, extracting, formatting, completed, failed.
Name
pages_processed
Type
integer
Description
Number of pages successfully processed.
Name
total_pages
Type
integer
Description
Total pages discovered (if known). Only available during crawling operations.
Name
partial_results
Type
array
Description
Incremental results for streaming responses. Each item contains extracted data from individual pages.
Name
execution_time
Type
integer
Description
Total execution time in milliseconds.
Name
error
Type
string
Description
Error message if parsing failed.

Available Templates & Examples

Get information about supported templates and example schemas:

Templates Endpoint

curl -X GET "https://api.supacrawler.com/v1/parse/templates" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response:

{
  "success": true,
  "templates": {
    "workflow_prompt": "Intelligent prompt-based parsing with automatic crawl/scrape detection",
    "streaming": "Real-time streaming results as content is processed",
    "schema_based": "Structured extraction using user-provided JSON schemas"
  },
  "content_types": ["any"],
  "output_formats": ["json", "csv", "markdown"]
}

Examples Endpoint

curl -X GET "https://api.supacrawler.com/v1/parse/examples" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response:

{
  "success": true,
  "examples": {
    "blog_crawl_example": {
      "prompt": "Crawl https://example.com/blog and give me the 5 most recent posts in CSV.",
      "schema": {
        "type": "object",
        "properties": {
          "posts": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "title": { "type": "string" },
                "date": { "type": "string" },
                "url": { "type": "string" }
              },
              "required": ["title", "date", "url"]
            }
          }
        },
        "required": ["posts"]
      }
    },
    "product_scrape_example": {
      "prompt": "Extract product information from https://shop.example.com/product/123",
      "schema": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "price": { "type": "number" },
          "description": { "type": "string" },
          "in_stock": { "type": "boolean" }
        }
      }
    }
  }
}

Error Handling

Common Errors

Name
400 - Invalid Request
Description
Missing required prompt field or malformed JSON schema.
Name
422 - Unprocessable Entity
Description
AI couldn't understand the prompt or extract meaningful data.
Name
500 - Internal Server Error
Description
LLM provider API failure, rate limit exceeded, or system error.

Error Response Format

{
  "success": false,
  "error": "Failed to extract data: No content found at specified URLs",
  "workflow_status": "failed",
  "pages_processed": 0,
  "execution_time": 1200
}

Graceful Degradation

The system automatically handles failures:

Crawl fails → Falls back to single page scraping
Schema validation fails → Returns raw extracted text with warning
URL inaccessible → Skips problematic URLs, continues with others
LLM timeout → Retries with exponential backoff

LLM Model Configuration

Supported Models

Provider	Models	Default	Configuration
Gemini	gemini-1.5-flash, gemini-1.5-pro	✅ gemini-1.5-flash	`GEMINI_API_KEY`
OpenAI	gpt-4, gpt-4-turbo, gpt-3.5-turbo	gpt-4	`OPENAI_API_KEY`
Claude	claude-3-sonnet, claude-3-haiku	claude-3-sonnet	`CLAUDE_API_KEY`

Model Selection Strategy

The system automatically selects models based on task complexity:

Simple extraction → Faster, cheaper models (Gemini Flash, GPT-3.5)
Complex schemas → More capable models (GPT-4, Claude Sonnet)
Large content → Models with larger context windows

Best Practices

Prompt Design

Be Specific: "Extract product names and prices from the electronics category" vs "Get product info"
Include URLs: Always mention the target URLs in your prompt
Specify Format: Add "in CSV format" or "as JSON" to guide output formatting
Set Limits: Use max_pages to prevent excessive crawling

Schema Design

Use Proper Types: Specify "number", "boolean", "array" instead of just "string"
Mark Required Fields: Include "required" arrays for essential data
Nested Structures: Support complex data with nested objects and arrays
Validation Rules: Add "pattern", "minimum", "maximum" for data validation

Performance Optimization

Streaming for Large Operations: Enable stream: true for 10+ pages
Reasonable Limits: Don't set max_pages higher than needed
Cache-Friendly: Similar prompts benefit from internal caching
Schema Reuse: Consistent schemas improve extraction accuracy

Rate Limits & Pricing

Plan	Requests/minute	Requests/day	Pages/request	Streaming
Free	10	100	5	❌
Pro	100	10,000	50	✅
Enterprise	1,000	100,000	100	✅

Rate Limit Headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1640995200

Migration from Legacy Parse API

Before (Legacy HTML Parsing)

{
  "html_content": "<div>Product content...</div>",
  "user_prompt": "Extract product details",
  "output_spec": { "name": "string", "price": "number" },
  "output_format": "json"
}

After (Intelligent Workflow)

{
  "prompt": "Extract product details from https://shop.example.com/product",
  "schema": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "price": { "type": "number" }
    }
  },
  "output_format": "json"
}

Key Differences

Prompt-driven: Natural language instructions instead of separate content + prompt
URL-based: URLs in prompts instead of pre-scraped content
Schema-based: Full JSON schema instead of simple type specifications
Workflow-aware: Automatic scrape vs crawl decisions based on prompt analysis

Local Development & Testing

Test with Example Prompts

# Simple extraction
curl -X POST http://localhost:8081/v1/parse \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Extract the title and description from https://example.com"
  }'

# Bulk extraction  
curl -X POST http://localhost:8081/v1/parse \
  -d '{
    "prompt": "Crawl https://news.ycombinator.com and get the top 10 story titles",
    "output_format": "csv",
    "max_pages": 10
  }'

# Complex schema
curl -X POST http://localhost:8081/v1/parse \
  -d '{
    "prompt": "Extract all team member info from https://company.example.com/team",
    "schema": {
      "type": "object",
      "properties": {
        "team": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "role": { "type": "string" },
              "bio": { "type": "string" }
            }
          }
        }
      }
    }
  }'

Environment Variables

# Required: LLM Provider API Keys
GEMINI_API_KEY=your_gemini_key
OPENAI_API_KEY=your_openai_key  # Optional
CLAUDE_API_KEY=your_claude_key  # Optional

# Optional: Service Configuration
PARSE_MAX_RETRIES=3
PARSE_TIMEOUT=30s
PARSE_DEFAULT_MODEL=gemini-1.5-flash

The Parse API represents a significant evolution in web data extraction, moving from static HTML parsing to intelligent, AI-driven workflows that understand natural language and automatically orchestrate complex data collection processes.