ArXiv Math Semantic Search

AI-Powered Search and Q&A Over 700,000+ Mathematics Papers

January 12, 2026 • Tool Announcement • 5 min read

Try It Now

The MVP is live and free to use. Ask questions about mathematical research, search for papers by concept, or explore connections across the arXiv corpus.

Launch ArXiv Search →

MVP Notice: This is an early preview. The tool is currently free and unauthenticated, but this will change soon as we add user accounts and usage tracking. Feedback welcome!

What Is It?

A semantic search engine over the arXiv mathematics corpus that understands meaning, not just keywords. Ask natural language questions like "What did Rivin prove about hyperbolic geometry?" or "What are recent advances in random matrix theory?" and get relevant papers with AI-synthesized answers.

729K

Papers

290M

Text Chunks

LLM Options

<1s

Search Time

How It Works

1. Paper Processing

We extract full LaTeX source from arXiv papers (not just abstracts), chunk them into meaningful segments, and index both the text and mathematical content. This means you can search for specific theorems, definitions, or proof techniques.

2. Hybrid Search

The search pipeline combines two approaches:

Semantic search: BGE embeddings + pgvector find conceptually similar content
Full-text search: PostgreSQL GIN indexes for exact author/keyword matches

Author queries (e.g., "Tell me about Sarnak's work") automatically use hybrid mode: full-text to find the author's papers, then vector re-ranking for relevance.

3. Multi-LLM Chat

Retrieved chunks are passed to your choice of language model for synthesis:

DeepSeek v3 – Fast, good at math (default)
GPT-5.2 – OpenAI's latest
Claude Sonnet 4.5 – Anthropic's balanced model
Claude Opus 4.5 – Maximum capability

Example Queries

"What are the main results in geometric group theory from 2024?"
"Explain the connection between random matrices and number theory"
"What did Tao prove about prime gaps?"
"Recent advances in machine learning for theorem proving"
"Compare different approaches to the Riemann hypothesis"

Technical Stack

Database: PostgreSQL + pgvector for vector similarity search
Embeddings: BGE-large-en-v1.5 (1024 dimensions)
Backend: FastAPI (Python)
LLM APIs: Novita (DeepSeek), OpenAI, Anthropic
Indexing: Parallel workers with incremental updates

Current Status

The system is actively indexing papers. Current coverage:

2022-2025: Fully indexed (~200k papers)
2010-2021: In progress
Pre-2010: Queued

Search works on indexed papers now. Full corpus coverage expected within a few weeks.

What's Next

User authentication: Supabase integration for accounts and usage tracking
Citation graph: "Papers that cite this" and influence mapping
Personalization: Saved searches, reading lists, alerts
API access: Programmatic search for research tools

Try ArXiv Search →

Questions or feedback? Get in touch.