Supacrawler Docs
Use Cases

DeepSeek AI Integration

Cost-effective AI content processing with DeepSeek and Supacrawler. Combine Supacrawler's web scraping with DeepSeek AI for cost-effective content analysis and processing.

APIs Used

This integration uses the Scrape API for content extraction and the Parse API for AI-powered data structuring.

Quick Example

from supacrawler import SupacrawlerClient
from openai import OpenAI
import os

supacrawler = SupacrawlerClient(api_key=os.environ['SUPACRAWLER_API_KEY'])

deepseek = OpenAI(
    api_key=os.environ['DEEPSEEK_API_KEY'],
    base_url="https://api.deepseek.com"
)

def scrape_and_analyze(url, analysis_prompt):
    # Step 1: Scrape content
    result = supacrawler.scrape(url, format="markdown")
    
    # Step 2: Analyze with DeepSeek
    response = deepseek.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that analyzes web content."},
            {"role": "user", "content": f"Analyze this content from {url}:\n\n{result.content}\n\nFocus: {analysis_prompt}"}
        ],
        max_tokens=2000
    )
    
    return {
        'url': url,
        'title': result.title,
        'analysis': response.choices[0].message.content,
        'cost': response.usage.total_tokens * 0.00014 / 1000  # DeepSeek pricing
    }

result = scrape_and_analyze(
    url="https://techcrunch.com/ai",
    analysis_prompt="Summarize key AI trends and business opportunities"
)

print(f"Analysis: {result['analysis']}")
print(f"Cost: ${result['cost']:.4f}")

Batch Processing

def batch_analyze_urls(urls, prompt):
    results = []
    total_cost = 0
    
    for url in urls:
        result = scrape_and_analyze(url, prompt)
        results.append(result)
        total_cost += result['cost']
        print(f"✅ Processed {url} (${result['cost']:.4f})")
    
    print(f"\nTotal cost: ${total_cost:.4f}")
    return results

urls = [
    "https://techcrunch.com/article1",
    "https://techcrunch.com/article2",
    "https://techcrunch.com/article3"
]

analyses = batch_analyze_urls(urls, "Extract key insights and business implications")

Content Summarization

def summarize_website(url):
    # Crawl entire site
    job = supacrawler.create_crawl_job(
        url=url,
        depth=2,
        link_limit=20
    )
    
    crawl_result = supacrawler.wait_for_crawl(job.job_id)
    
    # Combine all content
    all_content = "\n\n".join([
        page.markdown 
        for url, page in crawl_result.data.crawl_data.items()
        if hasattr(page, 'markdown')
    ])
    
    # Summarize with DeepSeek
    response = deepseek.chat.completions.create(
        model="deepseek-chat",
        messages=[{
            "role": "user",
            "content": f"Provide a comprehensive summary of this website:\n\n{all_content[:50000]}"
        }]
    )
    
    return response.choices[0].message.content

Cost Comparison

ModelInput ($/1M tokens)Output ($/1M tokens)Use Case
DeepSeek Chat$0.14$0.28General analysis
GPT-4 Turbo$10.00$30.00Complex reasoning
GPT-3.5 Turbo$0.50$1.50Simple tasks

Best Practices

  • Use DeepSeek for cost-sensitive applications
  • Batch process multiple URLs to reduce overhead
  • Cache scraped content to avoid re-scraping
  • Monitor token usage for cost optimization
  • Combine with Supabase for data storage
  • Use streaming for real-time responses

Was this page helpful?