Scrape Endpoint
Extract content from web pages using the Python SDK
Basic Usage
Markdown Scrape
from supacrawler import SupacrawlerClient
client = SupacrawlerClient(api_key="YOUR_API_KEY")
# Basic markdown scrape
result = client.scrape("https://supabase.com", format="markdown")
print(result.markdown)
print(result.metadata)
HTML Scrape
# Get both HTML and markdown
result = client.scrape(
"https://example.com",
format="markdown",
include_html=True
)
print("Markdown:", result.markdown)
print("HTML:", result.html)
Links Extraction
# Extract all links from a page
result = client.scrape(
"https://supacrawler.com",
format="links",
depth=2,
max_links=10
)
print("Discovered links:", result.links)
# Output: ['https://supacrawler.com/pricing', 'https://supacrawler.com/about', ...]
Rendering JavaScript
For websites that require JavaScript to load content:
# This will fail for JavaScript-heavy sites
result = client.scrape("https://ai.google.dev/gemini-api/docs", format="markdown")
# Error: Content not loaded properly
# Enable JS rendering
result = client.scrape(
"https://ai.google.dev/gemini-api/docs",
format="markdown",
render_js=True
)
print(result.markdown) # ✅ Works!
Advanced Options
Fresh Content
Bypass cache to get the latest content:
result = client.scrape(
"https://news-site.com/article",
format="markdown",
fresh=True # Skip cache, always fetch fresh
)
Custom Wait Time
Wait for dynamic content to load:
result = client.scrape(
"https://dynamic-site.com",
format="markdown",
render_js=True,
wait_for=5000 # Wait 5 seconds for content to load
)
Extract Specific Elements
Use CSS selectors to target specific content:
result = client.scrape(
"https://example.com",
format="markdown",
selector="#main-content" # Only extract content from #main-content
)
Response Object
The scrape response includes:
from supacrawler.scraper_client.models import ScrapeResponse
result: ScrapeResponse = client.scrape("https://example.com", format="markdown")
# Access properties
print(result.markdown) # Markdown content
print(result.html) # HTML content (if include_html=True)
print(result.links) # Links (if format="links")
print(result.metadata) # Metadata object
# Metadata properties
print(result.metadata.title) # Page title
print(result.metadata.description) # Meta description
print(result.metadata.language) # Page language
print(result.metadata.status_code) # HTTP status code
print(result.metadata.source_url) # Source URL
Format Options
Markdown
result = client.scrape("https://example.com", format="markdown")
print(result.markdown)
HTML
result = client.scrape("https://example.com", format="html")
print(result.html)
Links
result = client.scrape(
"https://example.com",
format="links",
depth=2, # Crawl 2 levels deep
max_links=50 # Limit to 50 links
)
print(result.links)
Complete Example
import os
from dotenv import load_dotenv
from supacrawler import SupacrawlerClient
from supacrawler.scraper_client.models import GetV1ScrapeFormat
load_dotenv()
client = SupacrawlerClient(api_key=os.environ.get("SUPACRAWLER_API_KEY"))
# Comprehensive scrape with all options
result = client.scrape(
url="https://supabase.com/docs",
format=GetV1ScrapeFormat.MARKDOWN,
render_js=True,
include_html=True,
fresh=False,
wait_for=3000,
max_links=100
)
# Process results
if result.markdown:
print(f"Content length: {len(result.markdown)} characters")
print(f"Title: {result.metadata.title}")
print(f"Status: {result.metadata.status_code}")
# Save to file
with open("output.md", "w") as f:
f.write(result.markdown)
Error Handling
try:
result = client.scrape("https://example.com", format="markdown")
except Exception as e:
print(f"scrape failed: {e}")
Next Steps
- Crawl Endpoint - Scrape multiple pages recursively
- Parse Endpoint - AI-powered data extraction
- Screenshots Endpoint - Capture visual snapshots
Was this page helpful?