Crawl
The Crawl API allows you to create and manage asynchronous crawling jobs. Perfect for extracting content from multiple pages, entire websites, or when you need to process large amounts of data without waiting for immediate responses.
Quick example
Create a crawling job to extract content from multiple pages:
Start a basic crawl job for Supabase Docs
curl https://api.supacrawler.com/api/v1/crawl \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://supabase.com/docs",
"format": "markdown",
"depth": 2,
"link_limit": 10,
"include_subdomains": false
}'
Job created
{
"success": true,
"job_id": "550e8400-e29b-41d4-a716-446655440000",
}
curl https://api.supacrawler.com/api/v1/crawl/550e8400-e29b-41d4-a716-446655440000 \
-H "Authorization: Bearer YOUR_API_KEY"
Results when completed
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"type": "crawl",
"status": "completed",
"data": {
"url": "https://supabase.com/docs",
"crawl_data": {
"https://supabase.com/docs": {
"markdown": "# Example Domain\n\nThis domain is for use...",
"html": "<html><head><title>Example Domain</title></head>...",
"metadata": { "title": "Example Domain", "status_code": 200 }
},
"https://supabase.com/docs/about": {
"markdown": "# About Us\n\nWe are a company...",
"html": "<html><head><title>About Us</title></head>...",
"metadata": { "title": "About Us", "status_code": 200 }
}
},
"error_data": {
"https://supabase.com/docs/broken": "404 Not Found"
},
"statistics": {
"total_pages": 3,
"successful_pages": 2,
"failed_pages": 1
},
"render_js": false
}
}
The job response model
Job responses contain information about the job status and results when completed.
Properties
- Name
job_id
- Type
- string
- Description
Unique identifier for the job.
- Name
type
- Type
- string
- Description
The type of job. Currently only supports
crawl
.
- Name
status
- Type
- string
- Description
Current status of the job:
processing
,completed
,failed
.
- Name
data
- Type
- object
- Description
Job results (only present when status is
completed
).- Name
url
- Type
- string
- Description
The original URL that was crawled.
- Name
crawl_data
- Type
- object
- Description
Object containing scraped data for each discovered page, keyed by URL. Each page contains:
- Name
markdown
- Type
- string
- Description
Page content converted to markdown format.
- Name
html
- Type
- string
- Description
Raw HTML content (only when format includes HTML).
- Name
metadata
- Type
- object
- Description
Page metadata including title and status code.
- Name
error_data
- Type
- object
- Description
Object containing error messages for failed URLs, keyed by URL.
- Name
statistics
- Type
- object
- Description
Crawl statistics.
- Name
total_pages
- Type
- integer
- Description
Total number of pages attempted.
- Name
successful_pages
- Type
- integer
- Description
Number of pages successfully scraped.
- Name
failed_pages
- Type
- integer
- Description
Number of pages that failed to scrape.
- Name
render_js
- Type
- boolean
- Description
Whether JavaScript rendering was used for this crawl.
Create a crawl job
This endpoint creates a new crawling job that will discover and scrape multiple pages from a website. The job runs asynchronously, and you can check its status using the returned job ID.
Required attributes
- Name
url
- Type
- string
- Description
The starting URL for the crawl. Must be a valid HTTP or HTTPS URL.
- Name
type
- Type
- string
- Description
The type of job to create. Currently only
crawl
is supported.
Optional attributes
- Name
format
- Type
- string
- Description
Content format for scraped pages:
markdown
(default),html
, ortext
.
- Name
link_limit
- Type
- integer
- Description
Maximum number of pages to crawl (default: 10). Controls how many total pages will be scraped.
- Name
depth
- Type
- integer
- Description
How deep to crawl from the starting URL (default: 2). Depth 1 = only the starting page, depth 2 = starting page + linked pages, etc.
- Name
include_subdomains
- Type
- boolean
- Description
Whether to include subdomains when crawling (default: false). If true, will crawl links to subdomains of the starting URL.
- Name
patterns
- Type
- array
- Description
Optional URL patterns to include during crawling (e.g., ["/blog/", "/docs/"]). Only URLs matching these patterns will be crawled.
- Name
render_js
- Type
- boolean
- Description
Enable JavaScript rendering for dynamic content (default: false). When true, pages will be rendered with a browser to capture JavaScript-generated content.
Request
curl https://api.supacrawler.com/api/v1/crawl \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://supabase.com/docs",
"type": "crawl",
"format": "markdown",
"link_limit": 50,
"depth": 2,
"include_subdomains": false,
"render_js": false,
"patterns": ["/blog/*", "/docs/*"]
}'
Response
{
"success": true,
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"type": "crawl",
"status": "processing",
"status_url": "/v1/crawl/550e8400-e29b-41d4-a716-446655440000"
}
Get job status
This endpoint allows you to check the status of a job and retrieve results when the job is completed. Jobs typically take a few seconds to several minutes depending on the number of pages being crawled.
Path parameters
- Name
id
- Type
- string
- Description
The unique job ID returned when creating the job.
Request
curl https://api.supacrawler.com/api/v1/crawl/550e8400-e29b-41d4-a716-446655440000 \
-H "Authorization: Bearer YOUR_API_KEY"
Response (Processing)
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"type": "crawl",
"status": "processing"
}
Response (Completed)
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"type": "crawl",
"status": "completed",
"data": {
"url": "https://supabase.com/docs",
"crawl_data": {
"https://supabase.com/docs": {
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
"html": "<html><head><title>Example Domain</title></head>...",
"metadata": {
"title": "Example Domain",
"status_code": 200
}
},
"https://supabase.com/docs/about": {
"markdown": "# About Us\n\nWe are a company that...",
"html": "<html><head><title>About Us</title></head>...",
"metadata": {
"title": "About Us",
"status_code": 200
}
}
},
"error_data": {
"https://supabase.com/docs/broken": "404 Not Found"
},
"statistics": {
"total_pages": 3,
"successful_pages": 2,
"failed_pages": 1
},
"render_js": false
}
}
Job lifecycle
Understanding the job lifecycle helps you integrate crawling jobs effectively:
1. Job Creation
When you create a job, it's immediately queued for processing and returns a processing
status.
2. Processing
The job discovers links on the starting page, then crawls each discovered page up to the specified depth and page limits.
3. Completion
Once all pages are crawled, the job status changes to completed
and results become available.
4. Failure
If the job encounters critical errors (like the starting URL being unreachable), it will have a failed
status.
Job retention
- Active jobs: Jobs in
processing
status are kept in memory for fast status checks - Completed jobs: Results are stored for 24 hours after completion
- Failed jobs: Error information is kept for 1 hour for debugging
After these periods, you'll need to create a new job to re-crawl the content.
Best practices
Polling for results
Check job status every 5-10 seconds rather than continuously polling:
Polling example
# Check status periodically
while true; do
response=$(curl -s https://api.supacrawler.com/api/v1/crawl/YOUR_JOB_ID \
-H "Authorization: Bearer YOUR_API_KEY")
status=$(echo $response | jq -r '.status')
if [ "$status" = "completed" ] || [ "$status" = "failed" ]; then
echo $response | jq .
break
fi
echo "Job still processing..."
sleep 10
done
Optimizing crawl settings
- Start with smaller
depth
andlink_limit
values to test - Use
depth: 1
for single-page extraction - Use
depth: 2-3
for section crawling - Use
depth: 4
only for comprehensive site crawling
Using URL patterns
Filter crawled pages using the patterns
parameter:
curl https://api.supacrawler.com/api/v1/crawl \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://supabase.com/docs",
"type": "crawl",
"depth": 3,
"link_limit": 100,
"patterns": ["/blog/*", "/docs/*", "/help/*"]
}'
Error handling
- Name
400 Bad Request
- Description
Invalid job parameters or missing required fields.
- Name
401 Unauthorized
- Description
- Invalid or missing API key.
- Name
404 Not Found
- Description
- Job ID not found or expired.
- Name
429 Too Many Requests
- Description
Rate limit exceeded or too many concurrent jobs.
Error response example
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"type": "crawl",
"status": "failed"
}