Self-Hosting
Deploy and manage your own Supacrawler instance for complete control over your web scraping infrastructure.
Why Self-Host?
Self-hosting Supacrawler gives you complete control over your infrastructure, eliminates API rate limits, and keeps your data private. Perfect for organizations with specific compliance requirements or high-volume use cases.
Quick Start
The fastest way to self-host Supacrawler is using Docker:
# Download the docker-compose file
curl -O https://raw.githubusercontent.com/Supacrawler/Supacrawler/main/docker-compose.yml
# Start the services
docker compose up -d
Your Supacrawler instance will be available at http://localhost:8081
Architecture
Supacrawler consists of three main components:
- API Server (Go): Handles HTTP requests and orchestrates scraping jobs
- Worker Pool (Node.js): Executes browser automation and content extraction
- Redis: Queue management and caching layer
Production Deployment
Docker Compose (Recommended)
version: '3.8'
services:
supacrawler:
image: ghcr.io/supacrawler/supacrawler:latest
ports:
- "8081:8081"
environment:
- REDIS_ADDR=redis:6379
- HTTP_ADDR=:8081
- DATA_DIR=/data
# Optional: Supabase integration
- SUPABASE_URL=${SUPABASE_URL}
- SUPABASE_SERVICE_KEY=${SUPABASE_SERVICE_KEY}
- SUPABASE_STORAGE_BUCKET=screenshots
volumes:
- ./data:/data
depends_on:
- redis
restart: unless-stopped
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis-data:/data
restart: unless-stopped
volumes:
redis-data:
Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: supacrawler
spec:
replicas: 3
selector:
matchLabels:
app: supacrawler
template:
metadata:
labels:
app: supacrawler
spec:
containers:
- name: supacrawler
image: ghcr.io/supacrawler/supacrawler:latest
ports:
- containerPort: 8081
env:
- name: REDIS_ADDR
value: "redis-service:6379"
- name: HTTP_ADDR
value: ":8081"
envFrom:
- configMapRef:
name: supacrawler-config
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
apiVersion: v1
kind: Service
metadata:
name: supacrawler-service
spec:
selector:
app: supacrawler
ports:
- protocol: TCP
port: 80
targetPort: 8081
type: LoadBalancer
apiVersion: v1
kind: ConfigMap
metadata:
name: supacrawler-config
data:
DATA_DIR: "/data"
SUPABASE_URL: "your-supabase-url"
SUPABASE_STORAGE_BUCKET: "screenshots"
Configuration
Environment Variables
Prop
Type
Optional: Supabase Storage Integration
For production deployments, we recommend using Supabase Storage for screenshot persistence:
- Create a Supabase project at supabase.com
- Create a storage bucket named
screenshots
- Add environment variables:
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_SERVICE_KEY=your-service-role-key
SUPABASE_STORAGE_BUCKET=screenshots
Scaling
Horizontal Scaling
Run multiple API server instances behind a load balancer:
# Instance 1
docker run -p 8081:8081 -e REDIS_ADDR=redis:6379 supacrawler
# Instance 2
docker run -p 8082:8081 -e REDIS_ADDR=redis:6379 supacrawler
# Instance 3
docker run -p 8083:8081 -e REDIS_ADDR=redis:6379 supacrawler
All instances will share the same Redis queue, allowing for distributed job processing.
Worker Pool Optimization
Adjust the number of workers based on your workload:
# High-volume configuration
MAX_WORKERS=50 ./supacrawler
Performance Tip
For optimal performance, allocate 1-2 CPU cores and 500MB-1GB RAM per worker. Monitor resource usage and adjust accordingly.
Monitoring
Health Checks
Supacrawler exposes a health check endpoint:
curl http://localhost:8081/v1/health
Expected response:
{
"status": "healthy",
"redis": "connected",
"version": "1.0.0"
}
Metrics
Monitor key metrics for production deployments:
- Request rate: Track API requests per second
- Job queue depth: Monitor pending jobs in Redis
- Worker utilization: Percentage of busy workers
- Error rate: Failed scraping jobs
- Response time: P50, P95, P99 latencies
Security
API Authentication
In production, implement API key authentication:
- Generate API keys for your users
- Add authentication middleware
- Validate keys on each request
Example with Nginx reverse proxy:
location /api/ {
if ($http_authorization != "Bearer your-secure-key") {
return 401;
}
proxy_pass http://supacrawler:8081;
}
Network Security
- Run Supacrawler in a private network
- Expose only the API endpoint
- Use TLS/SSL for all connections
- Implement rate limiting
Backup & Recovery
Redis Persistence
Enable Redis persistence in redis.conf
:
save 900 1
save 300 10
save 60 10000
Data Directory
Regularly backup the DATA_DIR
:
# Daily backup
tar -czf backup-$(date +%Y%m%d).tar.gz /path/to/data
Troubleshooting
Browser Dependencies
If you encounter "browser not found" errors:
# Install Playwright dependencies
npm install -g playwright
playwright install chromium --with-deps
Redis Connection Issues
Verify Redis is accessible:
redis-cli -h localhost -p 6379 ping
Expected response: PONG
Memory Issues
If workers are crashing due to memory:
- Reduce
MAX_WORKERS
- Increase container memory limits
- Enable Redis maxmemory policy
Upgrades
To upgrade your Supacrawler instance:
# Pull latest image
docker pull ghcr.io/supacrawler/supacrawler:latest
# Restart services
docker compose down
docker compose up -d
Zero Downtime Upgrades
For production, use rolling updates with multiple instances to achieve zero downtime during upgrades.
Support
- GitHub Issues: github.com/Supacrawler/Supacrawler/issues
- Discord Community: Join our Discord server
- Documentation: supacrawler.com/docs
Managed Alternative
Don't want to manage infrastructure? Try our managed service at supacrawler.com - 63% cheaper than alternatives with zero maintenance!
Was this page helpful?