Scrape

The Scrape API allows you to extract clean, structured content from any webpage. Get markdown, HTML, or plain text content along with metadata like title and status codes. When using format=links, it transforms into a powerful link discovery tool for site mapping.

Quick example

Extract clean markdown content from any webpage with a simple GET request:

curl -G https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d url="https://example.com" \
  -d format="markdown"

Response

{
  "success": true,
  "url": "https://example.com",
  "content": "# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)",
  "title": "Example Domain",
  "metadata": {
    "status_code": 200
  }
}

The scrape response model

The scrape response contains the extracted content and metadata about the webpage.

Properties

  • Name
    success
    Type
    boolean
    Description

    Indicates whether the scrape operation was successful.

  • Name
    url
    Type
    string
    Description

    The URL that was scraped.

  • Name
    content
    Type
    string
    Description

    The extracted content in the requested format (markdown, HTML, or text). Only present for content formats.

  • Name
    title
    Type
    string
    Description

    The page title extracted from the webpage. Only present for content formats.

  • Name
    links
    Type
    array
    Description

    Array of discovered links. Only present when format=links.

  • Name
    discovered
    Type
    integer
    Description

    Number of links discovered. Only present when format=links.

  • Name
    metadata
    Type
    object
    Description

    Additional metadata about the scrape operation.

    • Name
      status_code
      Type
      integer
      Description

      HTTP status code returned by the target webpage.

    • Name
      depth
      Type
      integer
      Description

      Crawl depth used (only for links format).


GET/v1/scrape

Scrape a webpage

This endpoint allows you to scrape content from any publicly accessible webpage. The content is extracted and cleaned, removing ads, navigation, and other non-essential elements to give you the main content.

Required parameters

  • Name
    url
    Type
    string
    Description

    The URL of the webpage to scrape. Must be a valid HTTP or HTTPS URL.

Optional parameters

  • Name
    format
    Type
    string
    Description

    The format of the returned content. Options: markdown (default), html, text, links.

  • Name
    render_js
    Type
    boolean
    Description

    Use a real browser for rendering (true) or a fast HTTP fetch (false, default). Enable for SPAs and JavaScript-heavy sites.

  • Name
    wait
    Type
    integer
    Description

    Milliseconds to wait for JavaScript rendering. Only applies if render_js=true.

  • Name
    device
    Type
    string
    Description

    Device type for rendering: desktop (default), mobile. Only applies if render_js=true.

  • Name
    depth
    Type
    integer
    Description

    How deep to crawl for links. Only applies when format=links.

  • Name
    max_links
    Type
    integer
    Description

    Maximum number of links to return. Only applies when format=links.

  • Name
    fresh
    Type
    boolean
    Description

    Skip cache and get fresh content (true) or use cached content if available (false, default). Cached content expires after 5 minutes for regular scrapes, 15 minutes for render_js=true.

Request

GET
/v1/scrape
curl -G https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d url="https://example.com" \
  -d format="markdown" \
  -d render_js=false

Response

{
  "success": true,
  "url": "https://example.com",
  "content": "# Example Domain\n\nThis domain is for use in illustrative examples in documents...",
  "title": "Example Domain",
  "metadata": {
    "status_code": 200
  }
}

GET/v1/scrape?format=links

When you set format=links, the scrape endpoint transforms into a link discovery tool that maps all links found on a webpage. This is useful for site mapping, link analysis, and discovering related content.

Required parameters

  • Name
    url
    Type
    string
    Description

    The URL of the webpage to map links from.

  • Name
    format
    Type
    string
    Description

    Must be set to links to enable link discovery mode.

Optional parameters

  • Name
    depth
    Type
    integer
    Description

    How deep to crawl for links. Controls how many levels of links to follow from the starting page.

  • Name
    max_links
    Type
    integer
    Description

    Maximum number of links to return. Controls the total number of links in the response.

Request

GET
/v1/scrape
curl -G https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d url="https://example.com" \
  -d format="links" \
  -d depth=2 \
  -d max_links=100

Response

{
  "success": true,
  "url": "https://example.com",
  "links": [
    "https://example.com/about",
    "https://example.com/contact",
    "https://example.com/products",
    "https://example.com/blog"
  ],
  "discovered": 4,
  "metadata": {
    "status_code": 200,
    "depth": 2
  }
}

Content formats

The scrape API supports multiple output formats to suit different use cases:

Markdown (default)

Clean, structured markdown that preserves formatting while removing clutter. Perfect for content analysis and processing.

HTML

Raw HTML content with cleaning applied. Useful when you need to preserve specific HTML structure or styling.

Text

Plain text with all HTML markup removed. Ideal for text analysis, search indexing, or when you only need the textual content.

Links

Special format that returns all discovered links instead of page content. Great for site mapping and link analysis.


Error handling

The scrape API returns appropriate HTTP status codes and error messages:

  • Name
    400 Bad Request
    Description

    Invalid URL or missing required parameters.

  • Name
    401 Unauthorized
    Description
    Invalid or missing API key.
  • Name
    429 Too Many Requests
    Description

    Rate limit exceeded. Check your plan limits.

  • Name
    500 Internal Server Error
    Description

    Failed to scrape the target URL (may be blocked, unreachable, or invalid).

Error response example

{
  "success": false,
  "url": "https://invalid-url.com",
  "error": "Invalid query parameters",
  "metadata": {
    "status_code": 400
  }
}

Best practices

JavaScript-heavy sites

For websites that rely heavily on JavaScript (e.g., Single Page Applications built with React, Vue, or Svelte), you must enable the render_js parameter. This uses a full browser to execute JavaScript and get the final, accurate content. For simple, static sites, leaving render_js=false (the default) will be significantly faster.

curl -G https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d url="https://spa-example.com" \
  -d format="markdown" \
  -d render_js=true \
  -d wait=3000

Mobile vs Desktop

Different devices can return different content. Use the device parameter to specify. This only has an effect when render_js=true.

Caching behavior

Supacrawler automatically caches scraped content to improve performance and reduce costs:

  • Regular scrapes: Cached for 5 minutes
  • Render scrapes: Cached for 15 minutes (due to higher computational cost)
  • Fresh parameter: Use fresh=true to bypass cache and get the latest content
curl -G https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d url="https://news-site.com/article" \
  -d format="markdown"
curl -G https://api.supacrawler.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d url="https://news-site.com/article" \
  -d format="markdown" \
  -d fresh=true

Rate limiting

  • Free tier: 100 requests per hour
  • Pro tier: 1,000 requests per hour
  • Enterprise: Custom limits

Space out your requests appropriately to avoid rate limiting.

Was this page helpful?