Crawl

The Crawl API allows you to create and manage asynchronous crawling jobs. Perfect for extracting content from multiple pages, entire websites, or when you need to process large amounts of data without waiting for immediate responses.

Quick example

Create a crawling job to extract content from multiple pages:

Start a basic crawl job for Supabase Docs

curl https://api.supacrawler.com/api/v1/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://supabase.com/docs",
    "format": "markdown",
    "depth": 2,
    "link_limit": 10,
    "include_subdomains": false
  }'

Job created

{
  "success": true,
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
}

curl https://api.supacrawler.com/api/v1/crawl/550e8400-e29b-41d4-a716-446655440000 \
  -H "Authorization: Bearer YOUR_API_KEY"

Results when completed

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "crawl",
  "status": "completed",
  "data": {
    "url": "https://supabase.com/docs",
    "crawl_data": {
      "https://supabase.com/docs": {
        "markdown": "# Example Domain\n\nThis domain is for use...",
        "html": "<html><head><title>Example Domain</title></head>...",
        "metadata": { "title": "Example Domain", "status_code": 200 }
      },
      "https://supabase.com/docs/about": {
        "markdown": "# About Us\n\nWe are a company...",
        "html": "<html><head><title>About Us</title></head>...",
        "metadata": { "title": "About Us", "status_code": 200 }
      }
    },
    "error_data": {
      "https://supabase.com/docs/broken": "404 Not Found"
    },
    "statistics": {
      "total_pages": 3,
      "successful_pages": 2,
      "failed_pages": 1
    },
    "render_js": false
  }
}

The job response model

Job responses contain information about the job status and results when completed.

Properties

Name
job_id
Type
string
Description
Unique identifier for the job.
Name
type
Type
string
Description
The type of job. Currently only supports crawl.
Name
status
Type
string
Description
Current status of the job: processing, completed, failed.
Name
data
Type
object
Description
Job results (only present when status is completed).
- Name
  url
  Type
  string
  Description
  The original URL that was crawled.
- Name
  crawl_data
  Type
  object
  Description
  Object containing scraped data for each discovered page, keyed by URL. Each page contains:
  Name
  markdown
  Type
  string
  Description
  Page content converted to markdown format.
  Name
  html
  Type
  string
  Description
  Raw HTML content (only when format includes HTML).
  Name
  metadata
  Type
  object
  Description
  Page metadata including title and status code.
- Name
  error_data
  Type
  object
  Description
  Object containing error messages for failed URLs, keyed by URL.
- Name
  statistics
  Type
  object
  Description
  Crawl statistics.
  Name
  total_pages
  Type
  integer
  Description
  Total number of pages attempted.
  Name
  successful_pages
  Type
  integer
  Description
  Number of pages successfully scraped.
  Name
  failed_pages
  Type
  integer
  Description
  Number of pages that failed to scrape.
- Name
  render_js
  Type
  boolean
  Description
  Whether JavaScript rendering was used for this crawl.

POST/v1/crawl

Create a crawl job

This endpoint creates a new crawling job that will discover and scrape multiple pages from a website. The job runs asynchronously, and you can check its status using the returned job ID.

Required attributes

Name
url
Type
string
Description
The starting URL for the crawl. Must be a valid HTTP or HTTPS URL.
Name
type
Type
string
Description
The type of job to create. Currently only crawl is supported.

Optional attributes

Name
format
Type
string
Description
Content format for scraped pages: markdown (default), html, or text.
Name
link_limit
Type
integer
Description
Maximum number of pages to crawl (default: 10). Controls how many total pages will be scraped.
Name
depth
Type
integer
Description
How deep to crawl from the starting URL (default: 2). Depth 1 = only the starting page, depth 2 = starting page + linked pages, etc.
Name
include_subdomains
Type
boolean
Description
Whether to include subdomains when crawling (default: false). If true, will crawl links to subdomains of the starting URL.
Name
patterns
Type
array
Description
Optional URL patterns to include during crawling (e.g., ["/blog/", "/docs/"]). Only URLs matching these patterns will be crawled.
Name
render_js
Type
boolean
Description
Enable JavaScript rendering for dynamic content (default: false). When true, pages will be rendered with a browser to capture JavaScript-generated content.

Request

POST

/v1/crawl

curl https://api.supacrawler.com/api/v1/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://supabase.com/docs",
    "type": "crawl",
    "format": "markdown",
    "link_limit": 50,
    "depth": 2,
    "include_subdomains": false,
    "render_js": false,
    "patterns": ["/blog/*", "/docs/*"]
  }'

Response

{
  "success": true,
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "crawl",
  "status": "processing",
  "status_url": "/v1/crawl/550e8400-e29b-41d4-a716-446655440000"
}

GET/v1/crawl/{id}

Get job status

This endpoint allows you to check the status of a job and retrieve results when the job is completed. Jobs typically take a few seconds to several minutes depending on the number of pages being crawled.

Path parameters

Name
id
Type
string
Description
The unique job ID returned when creating the job.

Request

GET

/v1/crawl/{id}

curl https://api.supacrawler.com/api/v1/crawl/550e8400-e29b-41d4-a716-446655440000 \
  -H "Authorization: Bearer YOUR_API_KEY"

Response (Processing)

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "crawl",
  "status": "processing"
}

Response (Completed)

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "crawl",
  "status": "completed",
  "data": {
    "url": "https://supabase.com/docs",
    "crawl_data": {
      "https://supabase.com/docs": {
        "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
        "html": "<html><head><title>Example Domain</title></head>...",
        "metadata": {
          "title": "Example Domain",
          "status_code": 200
        }
      },
      "https://supabase.com/docs/about": {
        "markdown": "# About Us\n\nWe are a company that...",
        "html": "<html><head><title>About Us</title></head>...",
        "metadata": {
          "title": "About Us",
          "status_code": 200
        }
      }
    },
    "error_data": {
      "https://supabase.com/docs/broken": "404 Not Found"
    },
    "statistics": {
      "total_pages": 3,
      "successful_pages": 2,
      "failed_pages": 1
    },
    "render_js": false
  }
}

Job lifecycle

Understanding the job lifecycle helps you integrate crawling jobs effectively:

1. Job Creation

When you create a job, it's immediately queued for processing and returns a processing status.

2. Processing

The job discovers links on the starting page, then crawls each discovered page up to the specified depth and page limits.

3. Completion

Once all pages are crawled, the job status changes to completed and results become available.

4. Failure

If the job encounters critical errors (like the starting URL being unreachable), it will have a failed status.

Job retention

Active jobs: Jobs in processing status are kept in memory for fast status checks
Completed jobs: Results are stored for 24 hours after completion
Failed jobs: Error information is kept for 1 hour for debugging

After these periods, you'll need to create a new job to re-crawl the content.

Best practices

Polling for results

Check job status every 5-10 seconds rather than continuously polling:

Polling example

# Check status periodically
while true; do
  response=$(curl -s https://api.supacrawler.com/api/v1/crawl/YOUR_JOB_ID \
    -H "Authorization: Bearer YOUR_API_KEY")

  status=$(echo $response | jq -r '.status')

  if [ "$status" = "completed" ] || [ "$status" = "failed" ]; then
    echo $response | jq .
    break
  fi

  echo "Job still processing..."
  sleep 10
done

Optimizing crawl settings

Start with smaller depth and link_limit values to test
Use depth: 1 for single-page extraction
Use depth: 2-3 for section crawling
Use depth: 4 only for comprehensive site crawling

Using URL patterns

Filter crawled pages using the patterns parameter:

curl https://api.supacrawler.com/api/v1/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://supabase.com/docs",
    "type": "crawl",
    "depth": 3,
    "link_limit": 100,
    "patterns": ["/blog/*", "/docs/*", "/help/*"]
  }'

Error handling

Name
400 Bad Request
Description
Invalid job parameters or missing required fields.
Name
401 Unauthorized
Description
Invalid or missing API key.
Name
404 Not Found
Description
Job ID not found or expired.
Name
429 Too Many Requests
Description
Rate limit exceeded or too many concurrent jobs.

Error response example

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "crawl",
  "status": "failed"
}