Self-Hosting

Deploy and manage your own Supacrawler instance for complete control over your web scraping infrastructure.

Why Self-Host?

Self-hosting Supacrawler gives you complete control over your infrastructure, eliminates API rate limits, and keeps your data private. Perfect for organizations with specific compliance requirements or high-volume use cases.

Quick Start

The fastest way to self-host Supacrawler is using Docker:

# Download the docker-compose file
curl -O https://raw.githubusercontent.com/Supacrawler/Supacrawler/main/docker-compose.yml

# Start the services
docker compose up -d

Your Supacrawler instance will be available at http://localhost:8081

Architecture

Supacrawler consists of three main components:

API Server (Go): Handles HTTP requests and orchestrates scraping jobs
Worker Pool (Node.js): Executes browser automation and content extraction
Redis: Queue management and caching layer

Production Deployment

Docker Compose (Recommended)

version: '3.8'

services:
  supacrawler:
    image: ghcr.io/supacrawler/supacrawler:latest
    ports:
      - "8081:8081"
    environment:
      - REDIS_ADDR=redis:6379
      - HTTP_ADDR=:8081
      - DATA_DIR=/data
      # Optional: Supabase integration
      - SUPABASE_URL=${SUPABASE_URL}
      - SUPABASE_SERVICE_KEY=${SUPABASE_SERVICE_KEY}
      - SUPABASE_STORAGE_BUCKET=screenshots
    volumes:
      - ./data:/data
    depends_on:
      - redis
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    restart: unless-stopped

volumes:
  redis-data:

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: supacrawler
spec:
  replicas: 3
  selector:
    matchLabels:
      app: supacrawler
  template:
    metadata:
      labels:
        app: supacrawler
    spec:
      containers:
      - name: supacrawler
        image: ghcr.io/supacrawler/supacrawler:latest
        ports:
        - containerPort: 8081
        env:
        - name: REDIS_ADDR
          value: "redis-service:6379"
        - name: HTTP_ADDR
          value: ":8081"
        envFrom:
        - configMapRef:
            name: supacrawler-config
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"

apiVersion: v1
kind: Service
metadata:
  name: supacrawler-service
spec:
  selector:
    app: supacrawler
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8081
  type: LoadBalancer

apiVersion: v1
kind: ConfigMap
metadata:
  name: supacrawler-config
data:
  DATA_DIR: "/data"
  SUPABASE_URL: "your-supabase-url"
  SUPABASE_STORAGE_BUCKET: "screenshots"

Configuration

Environment Variables

Prop

Type

Optional: Supabase Storage Integration

For production deployments, we recommend using Supabase Storage for screenshot persistence:

Create a Supabase project at supabase.com
Create a storage bucket named screenshots
Add environment variables:

SUPABASE_URL=https://your-project.supabase.co
SUPABASE_SERVICE_KEY=your-service-role-key
SUPABASE_STORAGE_BUCKET=screenshots

Scaling

Horizontal Scaling

Run multiple API server instances behind a load balancer:

# Instance 1
docker run -p 8081:8081 -e REDIS_ADDR=redis:6379 supacrawler

# Instance 2
docker run -p 8082:8081 -e REDIS_ADDR=redis:6379 supacrawler

# Instance 3
docker run -p 8083:8081 -e REDIS_ADDR=redis:6379 supacrawler

All instances will share the same Redis queue, allowing for distributed job processing.

Worker Pool Optimization

Adjust the number of workers based on your workload:

# High-volume configuration
MAX_WORKERS=50 ./supacrawler

Performance Tip

For optimal performance, allocate 1-2 CPU cores and 500MB-1GB RAM per worker. Monitor resource usage and adjust accordingly.

Monitoring

Health Checks

Supacrawler exposes a health check endpoint:

curl http://localhost:8081/v1/health

Expected response:

{
  "status": "healthy",
  "redis": "connected",
  "version": "1.0.0"
}

Metrics

Monitor key metrics for production deployments:

Request rate: Track API requests per second
Job queue depth: Monitor pending jobs in Redis
Worker utilization: Percentage of busy workers
Error rate: Failed scraping jobs
Response time: P50, P95, P99 latencies

Security

API Authentication

In production, implement API key authentication:

Generate API keys for your users
Add authentication middleware
Validate keys on each request

Example with Nginx reverse proxy:

location /api/ {
    if ($http_authorization != "Bearer your-secure-key") {
        return 401;
    }
    proxy_pass http://supacrawler:8081;
}

Network Security

Run Supacrawler in a private network
Expose only the API endpoint
Use TLS/SSL for all connections
Implement rate limiting

Backup & Recovery

Redis Persistence

Enable Redis persistence in redis.conf:

save 900 1
save 300 10
save 60 10000

Data Directory

Regularly backup the DATA_DIR:

# Daily backup
tar -czf backup-$(date +%Y%m%d).tar.gz /path/to/data

Troubleshooting

Browser Dependencies

If you encounter "browser not found" errors:

# Install Playwright dependencies
npm install -g playwright
playwright install chromium --with-deps

Redis Connection Issues

Verify Redis is accessible:

redis-cli -h localhost -p 6379 ping

Expected response: PONG

Memory Issues

If workers are crashing due to memory:

Reduce MAX_WORKERS
Increase container memory limits
Enable Redis maxmemory policy

Upgrades

To upgrade your Supacrawler instance:

# Pull latest image
docker pull ghcr.io/supacrawler/supacrawler:latest

# Restart services
docker compose down
docker compose up -d

Zero Downtime Upgrades

For production, use rolling updates with multiple instances to achieve zero downtime during upgrades.

Support

GitHub Issues: github.com/Supacrawler/Supacrawler/issues
Discord Community: Join our Discord server
Documentation: supacrawler.com/docs

Managed Alternative

Don't want to manage infrastructure? Try our managed service at supacrawler.com - 63% cheaper than alternatives with zero maintenance!

Was this page helpful?

Self-Hosting

On this page