Scrape
The Scrape API allows you to extract clean, structured content from any webpage. Get markdown, HTML, or plain text content along with metadata like title and status codes. When using format=links
, it transforms into a powerful link discovery tool for site mapping.
Quick example
Extract clean markdown content from any webpage with a simple GET request:
curl -G https://api.supacrawler.com/api/v1/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-d url="https://example.com" \
-d format="markdown"
Response
{
"success": true,
"url": "https://example.com",
"content": "# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)",
"title": "Example Domain",
"metadata": {
"status_code": 200
}
}
The scrape response model
The scrape response contains the extracted content and metadata about the webpage.
Properties
- Name
success
- Type
- boolean
- Description
Indicates whether the scrape operation was successful.
- Name
url
- Type
- string
- Description
The URL that was scraped.
- Name
content
- Type
- string
- Description
The extracted content in the requested format (markdown, HTML, or text). Only present for content formats.
- Name
title
- Type
- string
- Description
The page title extracted from the webpage. Only present for content formats.
- Name
links
- Type
- array
- Description
Array of discovered links. Only present when format=links.
- Name
discovered
- Type
- integer
- Description
Number of links discovered. Only present when format=links.
- Name
metadata
- Type
- object
- Description
Additional metadata about the scrape operation.
- Name
status_code
- Type
- integer
- Description
HTTP status code returned by the target webpage.
- Name
depth
- Type
- integer
- Description
Crawl depth used (only for links format).
Scrape a webpage
This endpoint allows you to scrape content from any publicly accessible webpage. The content is extracted and cleaned, removing ads, navigation, and other non-essential elements to give you the main content.
Required parameters
- Name
url
- Type
- string
- Description
The URL of the webpage to scrape. Must be a valid HTTP or HTTPS URL.
Optional parameters
- Name
format
- Type
- string
- Description
The format of the returned content. Options:
markdown
(default),html
,text
,links
.
- Name
render_js
- Type
- boolean
- Description
Use a real browser for rendering (
true
) or a fast HTTP fetch (false
, default). Enable for SPAs and JavaScript-heavy sites.
- Name
wait
- Type
- integer
- Description
Milliseconds to wait for JavaScript rendering. Only applies if
render_js=true
.
- Name
device
- Type
- string
- Description
Device type for rendering:
desktop
(default),mobile
. Only applies ifrender_js=true
.
- Name
depth
- Type
- integer
- Description
How deep to crawl for links. Only applies when
format=links
.
- Name
max_links
- Type
- integer
- Description
Maximum number of links to return. Only applies when
format=links
.
- Name
fresh
- Type
- boolean
- Description
Skip cache and get fresh content (
true
) or use cached content if available (false
, default). Cached content expires after 5 minutes for regular scrapes, 15 minutes for render_js=true.
Request
curl -G https://api.supacrawler.com/api/v1/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-d url="https://example.com" \
-d format="markdown" \
-d render_js=false
Response
{
"success": true,
"url": "https://example.com",
"content": "# Example Domain\n\nThis domain is for use in illustrative examples in documents...",
"title": "Example Domain",
"metadata": {
"status_code": 200
}
}
Map website links
When you set format=links
, the scrape endpoint transforms into a link discovery tool that maps all links found on a webpage. This is useful for site mapping, link analysis, and discovering related content.
Required parameters
- Name
url
- Type
- string
- Description
The URL of the webpage to map links from.
- Name
format
- Type
- string
- Description
Must be set to
links
to enable link discovery mode.
Optional parameters
- Name
depth
- Type
- integer
- Description
How deep to crawl for links. Controls how many levels of links to follow from the starting page.
- Name
max_links
- Type
- integer
- Description
Maximum number of links to return. Controls the total number of links in the response.
Request
curl -G https://api.supacrawler.com/api/v1/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-d url="https://example.com" \
-d format="links" \
-d depth=2 \
-d max_links=100
Response
{
"success": true,
"url": "https://example.com",
"links": [
"https://example.com/about",
"https://example.com/contact",
"https://example.com/products",
"https://example.com/blog"
],
"discovered": 4,
"metadata": {
"status_code": 200,
"depth": 2
}
}
Content formats
The scrape API supports multiple output formats to suit different use cases:
Markdown (default)
Clean, structured markdown that preserves formatting while removing clutter. Perfect for content analysis and processing.
HTML
Raw HTML content with cleaning applied. Useful when you need to preserve specific HTML structure or styling.
Text
Plain text with all HTML markup removed. Ideal for text analysis, search indexing, or when you only need the textual content.
Links
Special format that returns all discovered links instead of page content. Great for site mapping and link analysis.
Error handling
The scrape API returns appropriate HTTP status codes and error messages:
- Name
400 Bad Request
- Description
Invalid URL or missing required parameters.
- Name
401 Unauthorized
- Description
- Invalid or missing API key.
- Name
429 Too Many Requests
- Description
Rate limit exceeded. Check your plan limits.
- Name
500 Internal Server Error
- Description
Failed to scrape the target URL (may be blocked, unreachable, or invalid).
Error response example
{
"success": false,
"url": "https://invalid-url.com",
"error": "Invalid query parameters",
"metadata": {
"status_code": 400
}
}
Best practices
JavaScript-heavy sites
For websites that rely heavily on JavaScript (e.g., Single Page Applications built with React, Vue, or Svelte), you must enable the render_js
parameter. This uses a full browser to execute JavaScript and get the final, accurate content. For simple, static sites, leaving render_js=false
(the default) will be significantly faster.
curl -G https://api.supacrawler.com/api/v1/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-d url="https://spa-example.com" \
-d format="markdown" \
-d render_js=true \
-d wait=3000
Mobile vs Desktop
Different devices can return different content. Use the device
parameter to specify. This only has an effect when render_js=true
.
Caching behavior
Supacrawler automatically caches scraped content to improve performance and reduce costs:
- Regular scrapes: Cached for 5 minutes
- Render scrapes: Cached for 15 minutes (due to higher computational cost)
- Fresh parameter: Use
fresh=true
to bypass cache and get the latest content
curl -G https://api.supacrawler.com/api/v1/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-d url="https://news-site.com/article" \
-d format="markdown"
curl -G https://api.supacrawler.com/api/v1/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-d url="https://news-site.com/article" \
-d format="markdown" \
-d fresh=true
Rate limiting
- Free tier: 100 requests per hour
- Pro tier: 1,000 requests per hour
- Enterprise: Custom limits
Space out your requests appropriately to avoid rate limiting.