Supacrawler Docs
Contributing

Self-Hosting

Deploy and manage your own Supacrawler instance for complete control over your web scraping infrastructure.

Why Self-Host?

Self-hosting Supacrawler gives you complete control over your infrastructure, eliminates API rate limits, and keeps your data private. Perfect for organizations with specific compliance requirements or high-volume use cases.

Quick Start

The fastest way to self-host Supacrawler is using Docker:

# Download the docker-compose file
curl -O https://raw.githubusercontent.com/Supacrawler/Supacrawler/main/docker-compose.yml

# Start the services
docker compose up -d

Your Supacrawler instance will be available at http://localhost:8081

Architecture

Supacrawler consists of three main components:

  • API Server (Go): Handles HTTP requests and orchestrates scraping jobs
  • Worker Pool (Node.js): Executes browser automation and content extraction
  • Redis: Queue management and caching layer

Production Deployment

version: '3.8'

services:
  supacrawler:
    image: ghcr.io/supacrawler/supacrawler:latest
    ports:
      - "8081:8081"
    environment:
      - REDIS_ADDR=redis:6379
      - HTTP_ADDR=:8081
      - DATA_DIR=/data
      # Optional: Supabase integration
      - SUPABASE_URL=${SUPABASE_URL}
      - SUPABASE_SERVICE_KEY=${SUPABASE_SERVICE_KEY}
      - SUPABASE_STORAGE_BUCKET=screenshots
    volumes:
      - ./data:/data
    depends_on:
      - redis
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    restart: unless-stopped

volumes:
  redis-data:

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: supacrawler
spec:
  replicas: 3
  selector:
    matchLabels:
      app: supacrawler
  template:
    metadata:
      labels:
        app: supacrawler
    spec:
      containers:
      - name: supacrawler
        image: ghcr.io/supacrawler/supacrawler:latest
        ports:
        - containerPort: 8081
        env:
        - name: REDIS_ADDR
          value: "redis-service:6379"
        - name: HTTP_ADDR
          value: ":8081"
        envFrom:
        - configMapRef:
            name: supacrawler-config
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
apiVersion: v1
kind: Service
metadata:
  name: supacrawler-service
spec:
  selector:
    app: supacrawler
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8081
  type: LoadBalancer
apiVersion: v1
kind: ConfigMap
metadata:
  name: supacrawler-config
data:
  DATA_DIR: "/data"
  SUPABASE_URL: "your-supabase-url"
  SUPABASE_STORAGE_BUCKET: "screenshots"

Configuration

Environment Variables

Prop

Type

Optional: Supabase Storage Integration

For production deployments, we recommend using Supabase Storage for screenshot persistence:

  1. Create a Supabase project at supabase.com
  2. Create a storage bucket named screenshots
  3. Add environment variables:
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_SERVICE_KEY=your-service-role-key
SUPABASE_STORAGE_BUCKET=screenshots

Scaling

Horizontal Scaling

Run multiple API server instances behind a load balancer:

# Instance 1
docker run -p 8081:8081 -e REDIS_ADDR=redis:6379 supacrawler

# Instance 2
docker run -p 8082:8081 -e REDIS_ADDR=redis:6379 supacrawler

# Instance 3
docker run -p 8083:8081 -e REDIS_ADDR=redis:6379 supacrawler

All instances will share the same Redis queue, allowing for distributed job processing.

Worker Pool Optimization

Adjust the number of workers based on your workload:

# High-volume configuration
MAX_WORKERS=50 ./supacrawler

Performance Tip

For optimal performance, allocate 1-2 CPU cores and 500MB-1GB RAM per worker. Monitor resource usage and adjust accordingly.

Monitoring

Health Checks

Supacrawler exposes a health check endpoint:

curl http://localhost:8081/v1/health

Expected response:

{
  "status": "healthy",
  "redis": "connected",
  "version": "1.0.0"
}

Metrics

Monitor key metrics for production deployments:

  • Request rate: Track API requests per second
  • Job queue depth: Monitor pending jobs in Redis
  • Worker utilization: Percentage of busy workers
  • Error rate: Failed scraping jobs
  • Response time: P50, P95, P99 latencies

Security

API Authentication

In production, implement API key authentication:

  1. Generate API keys for your users
  2. Add authentication middleware
  3. Validate keys on each request

Example with Nginx reverse proxy:

location /api/ {
    if ($http_authorization != "Bearer your-secure-key") {
        return 401;
    }
    proxy_pass http://supacrawler:8081;
}

Network Security

  • Run Supacrawler in a private network
  • Expose only the API endpoint
  • Use TLS/SSL for all connections
  • Implement rate limiting

Backup & Recovery

Redis Persistence

Enable Redis persistence in redis.conf:

save 900 1
save 300 10
save 60 10000

Data Directory

Regularly backup the DATA_DIR:

# Daily backup
tar -czf backup-$(date +%Y%m%d).tar.gz /path/to/data

Troubleshooting

Browser Dependencies

If you encounter "browser not found" errors:

# Install Playwright dependencies
npm install -g playwright
playwright install chromium --with-deps

Redis Connection Issues

Verify Redis is accessible:

redis-cli -h localhost -p 6379 ping

Expected response: PONG

Memory Issues

If workers are crashing due to memory:

  1. Reduce MAX_WORKERS
  2. Increase container memory limits
  3. Enable Redis maxmemory policy

Upgrades

To upgrade your Supacrawler instance:

# Pull latest image
docker pull ghcr.io/supacrawler/supacrawler:latest

# Restart services
docker compose down
docker compose up -d

Zero Downtime Upgrades

For production, use rolling updates with multiple instances to achieve zero downtime during upgrades.

Support

Managed Alternative

Don't want to manage infrastructure? Try our managed service at supacrawler.com - 63% cheaper than alternatives with zero maintenance!

Was this page helpful?