Parse API

Transform natural language requests into structured web data using AI-powered parsing workflows that automatically crawl or scrape websites.

Overview

The Parse API revolutionizes web data extraction by understanding natural language prompts and intelligently orchestrating the extraction process:

  • Prompt-Driven: Natural language instructions like "Crawl blog for recent posts"
  • Intelligent Workflows: AI decides whether to scrape single pages or crawl multiple pages
  • Flexible Schemas: Support for any JSON schema from simple to deeply nested structures
  • Multiple Formats: JSON, CSV, or Markdown output
  • Streaming Support: Real-time results for large crawling operations
  • LLM-Powered: Uses Gemini, OpenAI, Claude for intelligent extraction

Quick Start

Simple Product Extraction

curl -X POST "https://api.supacrawler.com/v1/parse" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "prompt": "Extract product information from https://shop.example.com/iphone",
    "schema": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "price": { "type": "number" },
        "in_stock": { "type": "boolean" }
      },
      "required": ["name", "price"]
    },
    "output_format": "json"
  }'

Response

{
  "success": true,
  "data": {
    "name": "iPhone 15 Pro",
    "price": 999,
    "in_stock": true
  },
  "workflow_status": "completed",
  "pages_processed": 1,
  "execution_time": 2400
}

Request Parameters

  • Name
    prompt
    Type
    string
    Description

    Natural language instruction that may include URLs and extraction requirements. Examples: "Crawl example.com for blog posts", "Extract contact info from company page"

  • Name
    schema
    Type
    object
    Description

    Optional JSON schema defining the expected output structure. Supports any valid JSON schema including nested objects, arrays, and complex validation rules.

  • Name
    output_format
    Type
    string
    Description

    Preferred output format: "json" (default), "csv", or "markdown".

  • Name
    stream
    Type
    boolean
    Description

    Enable streaming responses for real-time results during crawling operations (default: false).

  • Name
    max_depth
    Type
    integer
    Description

    Maximum crawl depth for link following (1-3, default: 1). Only applies when AI decides to crawl.

  • Name
    max_pages
    Type
    integer
    Description

    Maximum pages to process (1-100, default: 10). Prevents runaway crawling operations.


How It Works: Intelligent Workflow

1. Prompt Analysis

The AI analyzes your natural language prompt to understand:

  • Action Type: Whether to scrape a single page or crawl multiple pages
  • URLs: Extracts mentioned URLs automatically
  • Output Format: Detects preferences like "CSV" or "JSON" from context
  • Extraction Goal: Understands what data you want to extract

2. Smart Data Collection

Based on the analysis, the system:

  • Scrapes: Single page extraction for specific URLs
  • Crawls: Multi-page discovery and extraction for comprehensive requests
  • Streams: Provides real-time updates for large operations

3. AI Data Extraction

  • Uses advanced LLMs to extract structured data from each page
  • Follows your JSON schema precisely if provided
  • Validates output format and structure automatically

4. Response Formatting

  • Aggregates data from all processed pages
  • Formats according to your preferences (JSON/CSV/Markdown)
  • Provides execution metadata and status tracking

Example Use Cases

Blog Post Crawling

Extract recent blog posts with metadata:

curl -X POST "https://api.supacrawler.com/v1/parse" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "prompt": "Crawl https://example.com/blog and give me the 5 most recent posts in CSV format",
    "schema": {
      "type": "object",
      "properties": {
        "posts": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "title": { "type": "string" },
              "date": { "type": "string" },
              "author": { "type": "string" },
              "url": { "type": "string" }
            },
            "required": ["title", "date", "url"]
          }
        }
      }
    },
    "output_format": "csv",
    "max_pages": 5
  }'

Response:

title,date,author,url
"Latest AI Trends","2024-01-15","John Doe","https://example.com/blog/ai-trends"
"Web Development Tips","2024-01-12","Jane Smith","https://example.com/blog/web-dev"

E-commerce Product Catalog

Extract product information from shopping sites:

curl -X POST "https://api.supacrawler.com/v1/parse" \
  -d '{
    "prompt": "Extract all product details from https://shop.example.com/category/electronics",
    "schema": {
      "type": "object",
      "properties": {
        "products": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "price": { "type": "number" },
              "rating": { "type": "number" },
              "features": { 
                "type": "array", 
                "items": { "type": "string" }
              },
              "availability": { "type": "boolean" }
            }
          }
        }
      }
    },
    "max_pages": 20
  }'

Contact Information Extraction

Extract contact details from company pages:

curl -X POST "https://api.supacrawler.com/v1/parse" \
  -d '{
    "prompt": "Find contact information including email, phone, and social links from https://company.example.com/contact",
    "schema": {
      "type": "object",
      "properties": {
        "email": { "type": "string" },
        "phone": { "type": "string" },
        "address": { "type": "string" },
        "social_links": {
          "type": "object",
          "properties": {
            "twitter": { "type": "string" },
            "linkedin": { "type": "string" },
            "facebook": { "type": "string" }
          }
        }
      }
    }
  }'

Job Listings Aggregation

Collect job postings with detailed requirements:

curl -X POST "https://api.supacrawler.com/v1/parse" \
  -d '{
    "prompt": "Crawl job board for software engineering positions and extract details",
    "schema": {
      "type": "object",
      "properties": {
        "jobs": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "title": { "type": "string" },
              "company": { "type": "string" },
              "location": { "type": "string" },
              "salary_range": { "type": "string" },
              "remote": { "type": "boolean" },
              "requirements": {
                "type": "array",
                "items": { "type": "string" }
              },
              "posted_date": { "type": "string" }
            }
          }
        }
      }
    },
    "output_format": "json",
    "max_depth": 2
  }'

Advanced Features

Streaming Responses

For large crawling operations, enable streaming to receive results in real-time:

curl -X POST "https://api.supacrawler.com/v1/parse" \
  -d '{
    "prompt": "Crawl entire news site for article headlines",
    "stream": true,
    "max_pages": 50
  }'

Streaming Response Format:

{
  "success": true,
  "workflow_status": "crawling",
  "pages_processed": 15,
  "total_pages": 50,
  "partial_results": [
    {"title": "Breaking News", "url": "..."},
    {"title": "Sports Update", "url": "..."}
  ]
}

Complex Nested Schemas

The API supports arbitrarily complex JSON schemas:

{
  "schema": {
    "type": "object",
    "properties": {
      "company": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "departments": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "name": { "type": "string" },
                "employees": {
                  "type": "array",
                  "items": {
                    "type": "object",
                    "properties": {
                      "name": { "type": "string" },
                      "role": { "type": "string" },
                      "skills": {
                        "type": "array",
                        "items": { "type": "string" }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Intelligent Action Detection

The AI automatically determines the best approach:

Prompt KeywordsActionBehavior
"crawl", "all pages", "entire site"CrawlDiscovers and processes multiple pages
"extract", "get", "parse" + single URLScrapeProcesses only the specified page
"CSV", "spreadsheet", "table"CSV OutputFormats data as comma-separated values
"JSON", "structured data"JSON OutputReturns structured JSON objects

Response Format

  • Name
    success
    Type
    boolean
    Description

    Whether the parsing operation succeeded.

  • Name
    data
    Type
    object | string
    Description

    Extracted data as JSON object, CSV string, or markdown text.

  • Name
    workflow_status
    Type
    string
    Description

    Current workflow stage: analyzing, scraping, crawling, extracting, formatting, completed, failed.

  • Name
    pages_processed
    Type
    integer
    Description

    Number of pages successfully processed.

  • Name
    total_pages
    Type
    integer
    Description

    Total pages discovered (if known). Only available during crawling operations.

  • Name
    partial_results
    Type
    array
    Description

    Incremental results for streaming responses. Each item contains extracted data from individual pages.

  • Name
    execution_time
    Type
    integer
    Description

    Total execution time in milliseconds.

  • Name
    error
    Type
    string
    Description

    Error message if parsing failed.


Available Templates & Examples

Get information about supported templates and example schemas:

Templates Endpoint

curl -X GET "https://api.supacrawler.com/v1/parse/templates" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response:

{
  "success": true,
  "templates": {
    "workflow_prompt": "Intelligent prompt-based parsing with automatic crawl/scrape detection",
    "streaming": "Real-time streaming results as content is processed",
    "schema_based": "Structured extraction using user-provided JSON schemas"
  },
  "content_types": ["any"],
  "output_formats": ["json", "csv", "markdown"]
}

Examples Endpoint

curl -X GET "https://api.supacrawler.com/v1/parse/examples" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response:

{
  "success": true,
  "examples": {
    "blog_crawl_example": {
      "prompt": "Crawl https://example.com/blog and give me the 5 most recent posts in CSV.",
      "schema": {
        "type": "object",
        "properties": {
          "posts": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "title": { "type": "string" },
                "date": { "type": "string" },
                "url": { "type": "string" }
              },
              "required": ["title", "date", "url"]
            }
          }
        },
        "required": ["posts"]
      }
    },
    "product_scrape_example": {
      "prompt": "Extract product information from https://shop.example.com/product/123",
      "schema": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "price": { "type": "number" },
          "description": { "type": "string" },
          "in_stock": { "type": "boolean" }
        }
      }
    }
  }
}

Error Handling

Common Errors

  • Name
    400 - Invalid Request
    Description

    Missing required prompt field or malformed JSON schema.

  • Name
    422 - Unprocessable Entity
    Description

    AI couldn't understand the prompt or extract meaningful data.

  • Name
    500 - Internal Server Error
    Description

    LLM provider API failure, rate limit exceeded, or system error.

Error Response Format

{
  "success": false,
  "error": "Failed to extract data: No content found at specified URLs",
  "workflow_status": "failed",
  "pages_processed": 0,
  "execution_time": 1200
}

Graceful Degradation

The system automatically handles failures:

  • Crawl fails → Falls back to single page scraping
  • Schema validation fails → Returns raw extracted text with warning
  • URL inaccessible → Skips problematic URLs, continues with others
  • LLM timeout → Retries with exponential backoff

LLM Model Configuration

Supported Models

ProviderModelsDefaultConfiguration
Geminigemini-1.5-flash, gemini-1.5-pro✅ gemini-1.5-flashGEMINI_API_KEY
OpenAIgpt-4, gpt-4-turbo, gpt-3.5-turbogpt-4OPENAI_API_KEY
Claudeclaude-3-sonnet, claude-3-haikuclaude-3-sonnetCLAUDE_API_KEY

Model Selection Strategy

The system automatically selects models based on task complexity:

  • Simple extraction → Faster, cheaper models (Gemini Flash, GPT-3.5)
  • Complex schemas → More capable models (GPT-4, Claude Sonnet)
  • Large content → Models with larger context windows

Best Practices

Prompt Design

  • Be Specific: "Extract product names and prices from the electronics category" vs "Get product info"
  • Include URLs: Always mention the target URLs in your prompt
  • Specify Format: Add "in CSV format" or "as JSON" to guide output formatting
  • Set Limits: Use max_pages to prevent excessive crawling

Schema Design

  • Use Proper Types: Specify "number", "boolean", "array" instead of just "string"
  • Mark Required Fields: Include "required" arrays for essential data
  • Nested Structures: Support complex data with nested objects and arrays
  • Validation Rules: Add "pattern", "minimum", "maximum" for data validation

Performance Optimization

  • Streaming for Large Operations: Enable stream: true for 10+ pages
  • Reasonable Limits: Don't set max_pages higher than needed
  • Cache-Friendly: Similar prompts benefit from internal caching
  • Schema Reuse: Consistent schemas improve extraction accuracy

Rate Limits & Pricing

PlanRequests/minuteRequests/dayPages/requestStreaming
Free101005
Pro10010,00050
Enterprise1,000100,000100

Rate Limit Headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1640995200

Migration from Legacy Parse API

Before (Legacy HTML Parsing)

{
  "html_content": "<div>Product content...</div>",
  "user_prompt": "Extract product details",
  "output_spec": { "name": "string", "price": "number" },
  "output_format": "json"
}

After (Intelligent Workflow)

{
  "prompt": "Extract product details from https://shop.example.com/product",
  "schema": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "price": { "type": "number" }
    }
  },
  "output_format": "json"
}

Key Differences

  1. Prompt-driven: Natural language instructions instead of separate content + prompt
  2. URL-based: URLs in prompts instead of pre-scraped content
  3. Schema-based: Full JSON schema instead of simple type specifications
  4. Workflow-aware: Automatic scrape vs crawl decisions based on prompt analysis

Local Development & Testing

Test with Example Prompts

# Simple extraction
curl -X POST http://localhost:8081/v1/parse \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Extract the title and description from https://example.com"
  }'

# Bulk extraction  
curl -X POST http://localhost:8081/v1/parse \
  -d '{
    "prompt": "Crawl https://news.ycombinator.com and get the top 10 story titles",
    "output_format": "csv",
    "max_pages": 10
  }'

# Complex schema
curl -X POST http://localhost:8081/v1/parse \
  -d '{
    "prompt": "Extract all team member info from https://company.example.com/team",
    "schema": {
      "type": "object",
      "properties": {
        "team": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "role": { "type": "string" },
              "bio": { "type": "string" }
            }
          }
        }
      }
    }
  }'

Environment Variables

# Required: LLM Provider API Keys
GEMINI_API_KEY=your_gemini_key
OPENAI_API_KEY=your_openai_key  # Optional
CLAUDE_API_KEY=your_claude_key  # Optional

# Optional: Service Configuration
PARSE_MAX_RETRIES=3
PARSE_TIMEOUT=30s
PARSE_DEFAULT_MODEL=gemini-1.5-flash

The Parse API represents a significant evolution in web data extraction, moving from static HTML parsing to intelligent, AI-driven workflows that understand natural language and automatically orchestrate complex data collection processes.

Was this page helpful?