LangChain + pgvector

This example uses:

  • Supacrawler (Python SDK) to crawl/scrape markdown
  • LangChain PGVector vector store (official) to store/query embeddings in Postgres (pgvector)
  • Embeddings provider: HuggingFace (local) or OpenAI (hosted)

Enable pgvector first: pgvector extension.

Install

pip install -U langchain langchain-community langchain-text-splitters \
  sqlalchemy langchain-postgres "psycopg[binary]" sentence-transformers

Python example (matches notebook)

import os
from sqlalchemy import create_engine
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_postgres import PGVector
from langchain_community.embeddings import HuggingFaceEmbeddings
# from langchain_openai import OpenAIEmbeddings  # optional

DATABASE_URL = os.environ.get('DATABASE_URL', 'postgresql://postgres:[email protected]:64322/postgres?sslmode=disable')
SUPACRAWLER_API_KEY = os.environ.get('SUPACRAWLER_API_KEY', 'YOUR_API_KEY')

from supacrawler import SupacrawlerClient, JobCreateRequest

crawler = SupacrawlerClient(api_key=SUPACRAWLER_API_KEY)
job = crawler.create_job(JobCreateRequest(
    url='https://supabase.com/docs/guides/auth',
    type='crawl',
    depth=1,
    link_limit=10,
    render_js=False,
))
final = crawler.wait_for_job(job.job_id, interval_seconds=3.0, timeout_seconds=60.0)

# Build Documents from crawl
crawl = getattr(getattr(final, 'data', None), 'crawl_data', {}) or {}
docs = [
    Document(
        page_content=(page.markdown or ''),
        metadata={'url': url, 'title': (getattr(page, 'metadata', None) or {}).__dict__.get('title') if hasattr(page, 'metadata') else None}
    )
    for url, page in crawl.items() if getattr(page, 'markdown', None)
]

# Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

# Embeddings (HF by default)
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
# Or: OpenAIEmbeddings(model='text-embedding-3-small', api_key=os.environ['OPENAI_API_KEY'])

# PGVector store
engine = create_engine(DATABASE_URL)
store = PGVector(connection=engine, collection_name='lc_docs', embeddings=embeddings, use_jsonb=True)
store.add_documents(chunks)

# Query
results = store.similarity_search('What are the possible auth methods?', k=3)
for d in results:
    print(d.metadata.get('url'), (d.metadata.get('title') or ''), d.page_content[:200].replace('\n', ' '), '...')

Expected output

Pages with content: 10
Chunks: 75
Added chunks to pgvector
https://supabase.com/docs/guides/auth Auth | Supabase Docs Auth ...
https://supabase.com/docs/guides/auth Auth | Supabase Docs Auth ...
https://supabase.com/docs/guides/auth/users Users | Supabase Docs ...

Notebook

See supacrawler-py/examples/langchain_vectors.ipynb for a runnable version with outputs.

pgvector setup

Enable pgvector on your database first.

  • Supabase: Database → Extensions → enable pgvector.
  • Self‑hosted Postgres:
create extension if not exists vector;

For production, create HNSW or IVFFlat indexes per Supabase guidance.

Was this page helpful?