Supacrawler Docs

LangChain + pgvector

Use Supacrawler to collect content, embed with LangChain, and store/query via pgvector. This example shows how to build a complete RAG pipeline with vector search.

LangChain + pgvector

This example uses Supacrawler to crawl content, LangChain for document processing, and pgvector for vector storage in Postgres.

Prerequisites

Enable pgvector extension in your Postgres database:

  • Supabase: Database → Extensions → enable pgvector
  • Self-hosted: CREATE EXTENSION IF NOT EXISTS vector;

Install Dependencies

pip install -U langchain langchain-community langchain-text-splitters \
  sqlalchemy langchain-postgres "psycopg[binary]" sentence-transformers \
  supacrawler-py

Complete Example

import os
from sqlalchemy import create_engine
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_postgres import PGVector
from langchain_community.embeddings import HuggingFaceEmbeddings
from supacrawler import SupacrawlerClient

# Configuration
DATABASE_URL = os.environ['DATABASE_URL']
SUPACRAWLER_API_KEY = os.environ['SUPACRAWLER_API_KEY']

# Step 1: Crawl content with Supacrawler
crawler = SupacrawlerClient(api_key=SUPACRAWLER_API_KEY)
job = crawler.create_crawl_job(
    url='https://supabase.com/docs/guides/auth',
    depth=1,
    link_limit=10
)
final = crawler.wait_for_crawl(job.job_id, interval_seconds=3.0)

# Step 2: Build Documents from crawl results
crawl_data = getattr(final.data, 'crawl_data', {}) or {}
docs = [
    Document(
        page_content=page.markdown or '',
        metadata={
            'url': url,
            'title': page.metadata.title if hasattr(page, 'metadata') else None
        }
    )
    for url, page in crawl_data.items()
    if hasattr(page, 'markdown') and page.markdown
]

print(f"Pages with content: {len(docs)}")

# Step 3: Chunk documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(docs)
print(f"Chunks: {len(chunks)}")

# Step 4: Create embeddings (HuggingFace local)
embeddings = HuggingFaceEmbeddings(
    model_name='sentence-transformers/all-MiniLM-L6-v2'
)

# Or use OpenAI embeddings:
# from langchain_openai import OpenAIEmbeddings
# embeddings = OpenAIEmbeddings(
#     model='text-embedding-3-small',
#     api_key=os.environ['OPENAI_API_KEY']
# )

# Step 5: Store in pgvector
engine = create_engine(DATABASE_URL)
store = PGVector(
    connection=engine,
    collection_name='lc_docs',
    embeddings=embeddings,
    use_jsonb=True
)
store.add_documents(chunks)
print("Added chunks to pgvector")

# Step 6: Query
results = store.similarity_search('What are the possible auth methods?', k=3)
for doc in results:
    print(f"URL: {doc.metadata.get('url')}")
    print(f"Title: {doc.metadata.get('title', 'N/A')}")
    print(f"Content: {doc.page_content[:200]}...")
    print("---")

Expected Output

Pages with content: 10
Chunks: 75
Added chunks to pgvector
URL: https://supabase.com/docs/guides/auth
Title: Auth | Supabase Docs
Content: Auth overview and configuration...
---

Next Steps

  • Increase link_limit to crawl more pages
  • Use OpenAI embeddings for better quality
  • Create HNSW/IVFFlat indexes for production
  • Implement RAG with your preferred LLM

Resources

Was this page helpful?