LangChain + pgvector
Use Supacrawler to collect content, embed with LangChain, and store/query via pgvector. This example shows how to build a complete RAG pipeline with vector search.
LangChain + pgvector
This example uses Supacrawler to crawl content, LangChain for document processing, and pgvector for vector storage in Postgres.
Prerequisites
Enable pgvector
extension in your Postgres database:
- Supabase: Database → Extensions → enable
pgvector
- Self-hosted:
CREATE EXTENSION IF NOT EXISTS vector;
Install Dependencies
pip install -U langchain langchain-community langchain-text-splitters \
sqlalchemy langchain-postgres "psycopg[binary]" sentence-transformers \
supacrawler-py
Complete Example
import os
from sqlalchemy import create_engine
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_postgres import PGVector
from langchain_community.embeddings import HuggingFaceEmbeddings
from supacrawler import SupacrawlerClient
# Configuration
DATABASE_URL = os.environ['DATABASE_URL']
SUPACRAWLER_API_KEY = os.environ['SUPACRAWLER_API_KEY']
# Step 1: Crawl content with Supacrawler
crawler = SupacrawlerClient(api_key=SUPACRAWLER_API_KEY)
job = crawler.create_crawl_job(
url='https://supabase.com/docs/guides/auth',
depth=1,
link_limit=10
)
final = crawler.wait_for_crawl(job.job_id, interval_seconds=3.0)
# Step 2: Build Documents from crawl results
crawl_data = getattr(final.data, 'crawl_data', {}) or {}
docs = [
Document(
page_content=page.markdown or '',
metadata={
'url': url,
'title': page.metadata.title if hasattr(page, 'metadata') else None
}
)
for url, page in crawl_data.items()
if hasattr(page, 'markdown') and page.markdown
]
print(f"Pages with content: {len(docs)}")
# Step 3: Chunk documents
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(docs)
print(f"Chunks: {len(chunks)}")
# Step 4: Create embeddings (HuggingFace local)
embeddings = HuggingFaceEmbeddings(
model_name='sentence-transformers/all-MiniLM-L6-v2'
)
# Or use OpenAI embeddings:
# from langchain_openai import OpenAIEmbeddings
# embeddings = OpenAIEmbeddings(
# model='text-embedding-3-small',
# api_key=os.environ['OPENAI_API_KEY']
# )
# Step 5: Store in pgvector
engine = create_engine(DATABASE_URL)
store = PGVector(
connection=engine,
collection_name='lc_docs',
embeddings=embeddings,
use_jsonb=True
)
store.add_documents(chunks)
print("Added chunks to pgvector")
# Step 6: Query
results = store.similarity_search('What are the possible auth methods?', k=3)
for doc in results:
print(f"URL: {doc.metadata.get('url')}")
print(f"Title: {doc.metadata.get('title', 'N/A')}")
print(f"Content: {doc.page_content[:200]}...")
print("---")
Expected Output
Pages with content: 10
Chunks: 75
Added chunks to pgvector
URL: https://supabase.com/docs/guides/auth
Title: Auth | Supabase Docs
Content: Auth overview and configuration...
---
Next Steps
- Increase
link_limit
to crawl more pages - Use OpenAI embeddings for better quality
- Create HNSW/IVFFlat indexes for production
- Implement RAG with your preferred LLM
Resources
Was this page helpful?