Crawl

The Crawl API allows you to create and manage asynchronous crawling jobs. Perfect for extracting content from multiple pages, entire websites, or when you need to process large amounts of data without waiting for immediate responses.

Quick example

Create a crawling job to extract content from multiple pages:

Start a basic crawl job for Supabase Docs

curl https://api.supacrawler.com/api/v1/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://supabase.com/docs",
    "format": "markdown",
    "depth": 2,
    "link_limit": 10,
    "include_subdomains": false
  }'

Job created

{
  "success": true,
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
}
curl https://api.supacrawler.com/api/v1/crawl/550e8400-e29b-41d4-a716-446655440000 \
  -H "Authorization: Bearer YOUR_API_KEY"

Results when completed

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "crawl",
  "status": "completed",
  "data": {
    "url": "https://supabase.com/docs",
    "crawl_data": {
      "https://supabase.com/docs": {
        "markdown": "# Example Domain\n\nThis domain is for use...",
        "html": "<html><head><title>Example Domain</title></head>...",
        "metadata": { "title": "Example Domain", "status_code": 200 }
      },
      "https://supabase.com/docs/about": {
        "markdown": "# About Us\n\nWe are a company...",
        "html": "<html><head><title>About Us</title></head>...",
        "metadata": { "title": "About Us", "status_code": 200 }
      }
    },
    "error_data": {
      "https://supabase.com/docs/broken": "404 Not Found"
    },
    "statistics": {
      "total_pages": 3,
      "successful_pages": 2,
      "failed_pages": 1
    },
    "render_js": false
  }
}

The job response model

Job responses contain information about the job status and results when completed.

Properties

  • Name
    job_id
    Type
    string
    Description

    Unique identifier for the job.

  • Name
    type
    Type
    string
    Description

    The type of job. Currently only supports crawl.

  • Name
    status
    Type
    string
    Description

    Current status of the job: processing, completed, failed.

  • Name
    data
    Type
    object
    Description

    Job results (only present when status is completed).

    • Name
      url
      Type
      string
      Description

      The original URL that was crawled.

    • Name
      crawl_data
      Type
      object
      Description

      Object containing scraped data for each discovered page, keyed by URL. Each page contains:

      • Name
        markdown
        Type
        string
        Description

        Page content converted to markdown format.

      • Name
        html
        Type
        string
        Description

        Raw HTML content (only when format includes HTML).

      • Name
        metadata
        Type
        object
        Description

        Page metadata including title and status code.

    • Name
      error_data
      Type
      object
      Description

      Object containing error messages for failed URLs, keyed by URL.

    • Name
      statistics
      Type
      object
      Description

      Crawl statistics.

      • Name
        total_pages
        Type
        integer
        Description

        Total number of pages attempted.

      • Name
        successful_pages
        Type
        integer
        Description

        Number of pages successfully scraped.

      • Name
        failed_pages
        Type
        integer
        Description

        Number of pages that failed to scrape.

    • Name
      render_js
      Type
      boolean
      Description

      Whether JavaScript rendering was used for this crawl.


POST/v1/crawl

Create a crawl job

This endpoint creates a new crawling job that will discover and scrape multiple pages from a website. The job runs asynchronously, and you can check its status using the returned job ID.

Required attributes

  • Name
    url
    Type
    string
    Description

    The starting URL for the crawl. Must be a valid HTTP or HTTPS URL.

  • Name
    type
    Type
    string
    Description

    The type of job to create. Currently only crawl is supported.

Optional attributes

  • Name
    format
    Type
    string
    Description

    Content format for scraped pages: markdown (default), html, or text.

  • Name
    link_limit
    Type
    integer
    Description

    Maximum number of pages to crawl (default: 10). Controls how many total pages will be scraped.

  • Name
    depth
    Type
    integer
    Description

    How deep to crawl from the starting URL (default: 2). Depth 1 = only the starting page, depth 2 = starting page + linked pages, etc.

  • Name
    include_subdomains
    Type
    boolean
    Description

    Whether to include subdomains when crawling (default: false). If true, will crawl links to subdomains of the starting URL.

  • Name
    patterns
    Type
    array
    Description

    Optional URL patterns to include during crawling (e.g., ["/blog/", "/docs/"]). Only URLs matching these patterns will be crawled.

  • Name
    render_js
    Type
    boolean
    Description

    Enable JavaScript rendering for dynamic content (default: false). When true, pages will be rendered with a browser to capture JavaScript-generated content.

Request

POST
/v1/crawl
curl https://api.supacrawler.com/api/v1/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://supabase.com/docs",
    "type": "crawl",
    "format": "markdown",
    "link_limit": 50,
    "depth": 2,
    "include_subdomains": false,
    "render_js": false,
    "patterns": ["/blog/*", "/docs/*"]
  }'

Response

{
  "success": true,
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "crawl",
  "status": "processing",
  "status_url": "/v1/crawl/550e8400-e29b-41d4-a716-446655440000"
}

GET/v1/crawl/{id}

Get job status

This endpoint allows you to check the status of a job and retrieve results when the job is completed. Jobs typically take a few seconds to several minutes depending on the number of pages being crawled.

Path parameters

  • Name
    id
    Type
    string
    Description

    The unique job ID returned when creating the job.

Request

GET
/v1/crawl/{id}
curl https://api.supacrawler.com/api/v1/crawl/550e8400-e29b-41d4-a716-446655440000 \
  -H "Authorization: Bearer YOUR_API_KEY"

Response (Processing)

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "crawl",
  "status": "processing"
}

Response (Completed)

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "crawl",
  "status": "completed",
  "data": {
    "url": "https://supabase.com/docs",
    "crawl_data": {
      "https://supabase.com/docs": {
        "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
        "html": "<html><head><title>Example Domain</title></head>...",
        "metadata": {
          "title": "Example Domain",
          "status_code": 200
        }
      },
      "https://supabase.com/docs/about": {
        "markdown": "# About Us\n\nWe are a company that...",
        "html": "<html><head><title>About Us</title></head>...",
        "metadata": {
          "title": "About Us",
          "status_code": 200
        }
      }
    },
    "error_data": {
      "https://supabase.com/docs/broken": "404 Not Found"
    },
    "statistics": {
      "total_pages": 3,
      "successful_pages": 2,
      "failed_pages": 1
    },
    "render_js": false
  }
}

Job lifecycle

Understanding the job lifecycle helps you integrate crawling jobs effectively:

1. Job Creation

When you create a job, it's immediately queued for processing and returns a processing status.

2. Processing

The job discovers links on the starting page, then crawls each discovered page up to the specified depth and page limits.

3. Completion

Once all pages are crawled, the job status changes to completed and results become available.

4. Failure

If the job encounters critical errors (like the starting URL being unreachable), it will have a failed status.


Job retention

  • Active jobs: Jobs in processing status are kept in memory for fast status checks
  • Completed jobs: Results are stored for 24 hours after completion
  • Failed jobs: Error information is kept for 1 hour for debugging

After these periods, you'll need to create a new job to re-crawl the content.


Best practices

Polling for results

Check job status every 5-10 seconds rather than continuously polling:

Polling example

# Check status periodically
while true; do
  response=$(curl -s https://api.supacrawler.com/api/v1/crawl/YOUR_JOB_ID \
    -H "Authorization: Bearer YOUR_API_KEY")

  status=$(echo $response | jq -r '.status')

  if [ "$status" = "completed" ] || [ "$status" = "failed" ]; then
    echo $response | jq .
    break
  fi

  echo "Job still processing..."
  sleep 10
done

Optimizing crawl settings

  • Start with smaller depth and link_limit values to test
  • Use depth: 1 for single-page extraction
  • Use depth: 2-3 for section crawling
  • Use depth: 4 only for comprehensive site crawling

Using URL patterns

Filter crawled pages using the patterns parameter:

curl https://api.supacrawler.com/api/v1/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://supabase.com/docs",
    "type": "crawl",
    "depth": 3,
    "link_limit": 100,
    "patterns": ["/blog/*", "/docs/*", "/help/*"]
  }'

Error handling

  • Name
    400 Bad Request
    Description

    Invalid job parameters or missing required fields.

  • Name
    401 Unauthorized
    Description
    Invalid or missing API key.
  • Name
    404 Not Found
    Description
    Job ID not found or expired.
  • Name
    429 Too Many Requests
    Description

    Rate limit exceeded or too many concurrent jobs.

Error response example

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "crawl",
  "status": "failed"
}

Was this page helpful?