Scrape

The Scrape API allows you to extract clean, structured content from any webpage. Get markdown, HTML, or plain text content along with metadata like title and status codes. When using format=links, it transforms into a powerful link discovery tool for site mapping.

Quick example

Extract clean markdown content from any webpage with a simple GET request:

curl -G https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d url="https://example.com" \
  -d format="markdown"

Response

{
  "success": true,
  "url": "https://example.com",
  "content": "# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)",
  "title": "Example Domain",
  "metadata": {
    "status_code": 200
  }
}

The scrape response model

The scrape response contains the extracted content and metadata about the webpage.

Properties

Name
success
Type
boolean
Description
Indicates whether the scrape operation was successful.
Name
url
Type
string
Description
The URL that was scraped.
Name
content
Type
string
Description
The extracted content in the requested format (markdown, HTML, or text). Only present for content formats.
Name
title
Type
string
Description
The page title extracted from the webpage. Only present for content formats.
Name
links
Type
array
Description
Array of discovered links. Only present when format=links.
Name
discovered
Type
integer
Description
Number of links discovered. Only present when format=links.
Name
metadata
Type
object
Description
Additional metadata about the scrape operation.
- Name
  status_code
  Type
  integer
  Description
  HTTP status code returned by the target webpage.
- Name
  depth
  Type
  integer
  Description
  Crawl depth used (only for links format).

GET/v1/scrape

Scrape a webpage

This endpoint allows you to scrape content from any publicly accessible webpage. The content is extracted and cleaned, removing ads, navigation, and other non-essential elements to give you the main content.

Required parameters

Name
url
Type
string
Description
The URL of the webpage to scrape. Must be a valid HTTP or HTTPS URL.

Optional parameters

Name
format
Type
string
Description
The format of the returned content. Options: markdown (default), html, text, links.
Name
render_js
Type
boolean
Description
Use a real browser for rendering (true) or a fast HTTP fetch (false, default). Enable for SPAs and JavaScript-heavy sites.
Name
wait
Type
integer
Description
Milliseconds to wait for JavaScript rendering. Only applies if render_js=true.
Name
device
Type
string
Description
Device type for rendering: desktop (default), mobile. Only applies if render_js=true.
Name
depth
Type
integer
Description
How deep to crawl for links. Only applies when format=links.
Name
max_links
Type
integer
Description
Maximum number of links to return. Only applies when format=links.
Name
fresh
Type
boolean
Description
Skip cache and get fresh content (true) or use cached content if available (false, default). Cached content expires after 5 minutes for regular scrapes, 15 minutes for render_js=true.

Request

GET

/v1/scrape

curl -G https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d url="https://example.com" \
  -d format="markdown" \
  -d render_js=false

Response

{
  "success": true,
  "url": "https://example.com",
  "content": "# Example Domain\n\nThis domain is for use in illustrative examples in documents...",
  "title": "Example Domain",
  "metadata": {
    "status_code": 200
  }
}

GET/v1/scrape?format=links

Map website links

When you set format=links, the scrape endpoint transforms into a link discovery tool that maps all links found on a webpage. This is useful for site mapping, link analysis, and discovering related content.

Required parameters

Name
url
Type
string
Description
The URL of the webpage to map links from.
Name
format
Type
string
Description
Must be set to links to enable link discovery mode.

Optional parameters

Name
depth
Type
integer
Description
How deep to crawl for links. Controls how many levels of links to follow from the starting page.
Name
max_links
Type
integer
Description
Maximum number of links to return. Controls the total number of links in the response.

Request

GET

/v1/scrape

curl -G https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d url="https://example.com" \
  -d format="links" \
  -d depth=2 \
  -d max_links=100

Response

{
  "success": true,
  "url": "https://example.com",
  "links": [
    "https://example.com/about",
    "https://example.com/contact",
    "https://example.com/products",
    "https://example.com/blog"
  ],
  "discovered": 4,
  "metadata": {
    "status_code": 200,
    "depth": 2
  }
}

Content formats

The scrape API supports multiple output formats to suit different use cases:

Markdown (default)

Clean, structured markdown that preserves formatting while removing clutter. Perfect for content analysis and processing.

HTML

Raw HTML content with cleaning applied. Useful when you need to preserve specific HTML structure or styling.

Text

Plain text with all HTML markup removed. Ideal for text analysis, search indexing, or when you only need the textual content.

Links

Special format that returns all discovered links instead of page content. Great for site mapping and link analysis.

Error handling

The scrape API returns appropriate HTTP status codes and error messages:

Name
400 Bad Request
Description
Invalid URL or missing required parameters.
Name
401 Unauthorized
Description
Invalid or missing API key.
Name
429 Too Many Requests
Description
Rate limit exceeded. Check your plan limits.
Name
500 Internal Server Error
Description
Failed to scrape the target URL (may be blocked, unreachable, or invalid).

Error response example

{
  "success": false,
  "url": "https://invalid-url.com",
  "error": "Invalid query parameters",
  "metadata": {
    "status_code": 400
  }
}

Best practices

JavaScript-heavy sites

For websites that rely heavily on JavaScript (e.g., Single Page Applications built with React, Vue, or Svelte), you must enable the render_js parameter. This uses a full browser to execute JavaScript and get the final, accurate content. For simple, static sites, leaving render_js=false (the default) will be significantly faster.

curl -G https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d url="https://spa-example.com" \
  -d format="markdown" \
  -d render_js=true \
  -d wait=3000

Mobile vs Desktop

Different devices can return different content. Use the device parameter to specify. This only has an effect when render_js=true.

Caching behavior

Supacrawler automatically caches scraped content to improve performance and reduce costs:

Regular scrapes: Cached for 5 minutes
Render scrapes: Cached for 15 minutes (due to higher computational cost)
Fresh parameter: Use fresh=true to bypass cache and get the latest content

curl -G https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d url="https://news-site.com/article" \
  -d format="markdown"

curl -G https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d url="https://news-site.com/article" \
  -d format="markdown" \
  -d fresh=true

Rate limiting

Free tier: 100 requests per hour
Pro tier: 1,000 requests per hour
Enterprise: Custom limits

Space out your requests appropriately to avoid rate limiting.