Supacrawler Docs

Scrape Endpoint

Extract content from web pages using the Python SDK

Basic Usage

Markdown Scrape

from supacrawler import SupacrawlerClient

client = SupacrawlerClient(api_key="YOUR_API_KEY")

# Basic markdown scrape
result = client.scrape("https://supabase.com", format="markdown")
print(result.markdown)
print(result.metadata)

HTML Scrape

# Get both HTML and markdown
result = client.scrape(
    "https://example.com",
    format="markdown",
    include_html=True
)

print("Markdown:", result.markdown)
print("HTML:", result.html)
# Extract all links from a page
result = client.scrape(
    "https://supacrawler.com",
    format="links",
    depth=2,
    max_links=10
)

print("Discovered links:", result.links)
# Output: ['https://supacrawler.com/pricing', 'https://supacrawler.com/about', ...]

Rendering JavaScript

For websites that require JavaScript to load content:

# This will fail for JavaScript-heavy sites
result = client.scrape("https://ai.google.dev/gemini-api/docs", format="markdown")
# Error: Content not loaded properly

# Enable JS rendering
result = client.scrape(
    "https://ai.google.dev/gemini-api/docs",
    format="markdown",
    render_js=True
)
print(result.markdown)  # ✅ Works!

Advanced Options

Fresh Content

Bypass cache to get the latest content:

result = client.scrape(
    "https://news-site.com/article",
    format="markdown",
    fresh=True  # Skip cache, always fetch fresh
)

Custom Wait Time

Wait for dynamic content to load:

result = client.scrape(
    "https://dynamic-site.com",
    format="markdown",
    render_js=True,
    wait_for=5000  # Wait 5 seconds for content to load
)

Extract Specific Elements

Use CSS selectors to target specific content:

result = client.scrape(
    "https://example.com",
    format="markdown",
    selector="#main-content"  # Only extract content from #main-content
)

Response Object

The scrape response includes:

from supacrawler.scraper_client.models import ScrapeResponse

result: ScrapeResponse = client.scrape("https://example.com", format="markdown")

# Access properties
print(result.markdown)      # Markdown content
print(result.html)          # HTML content (if include_html=True)
print(result.links)         # Links (if format="links")
print(result.metadata)      # Metadata object

# Metadata properties
print(result.metadata.title)           # Page title
print(result.metadata.description)     # Meta description
print(result.metadata.language)        # Page language
print(result.metadata.status_code)     # HTTP status code
print(result.metadata.source_url)      # Source URL

Format Options

Markdown

result = client.scrape("https://example.com", format="markdown")
print(result.markdown)

HTML

result = client.scrape("https://example.com", format="html")
print(result.html)
result = client.scrape(
    "https://example.com",
    format="links",
    depth=2,        # Crawl 2 levels deep
    max_links=50    # Limit to 50 links
)
print(result.links)

Complete Example

import os
from dotenv import load_dotenv
from supacrawler import SupacrawlerClient
from supacrawler.scraper_client.models import GetV1ScrapeFormat

load_dotenv()

client = SupacrawlerClient(api_key=os.environ.get("SUPACRAWLER_API_KEY"))

# Comprehensive scrape with all options
result = client.scrape(
    url="https://supabase.com/docs",
    format=GetV1ScrapeFormat.MARKDOWN,
    render_js=True,
    include_html=True,
    fresh=False,
    wait_for=3000,
    max_links=100
)

# Process results
if result.markdown:
    print(f"Content length: {len(result.markdown)} characters")
    print(f"Title: {result.metadata.title}")
    print(f"Status: {result.metadata.status_code}")
    
    # Save to file
    with open("output.md", "w") as f:
        f.write(result.markdown)

Error Handling

try:
    result = client.scrape("https://example.com", format="markdown")
except Exception as e:
    print(f"scrape failed: {e}")

Next Steps

Was this page helpful?