ArXiv Math Semantic Search
AI-Powered Search and Q&A Over 700,000+ Mathematics Papers
Try It Now
The MVP is live and free to use. Ask questions about mathematical research, search for papers by concept, or explore connections across the arXiv corpus.
Launch ArXiv Search →What Is It?
A semantic search engine over the arXiv mathematics corpus that understands meaning, not just keywords. Ask natural language questions like "What did Rivin prove about hyperbolic geometry?" or "What are recent advances in random matrix theory?" and get relevant papers with AI-synthesized answers.
How It Works
1. Paper Processing
We extract full LaTeX source from arXiv papers (not just abstracts), chunk them into meaningful segments, and index both the text and mathematical content. This means you can search for specific theorems, definitions, or proof techniques.
2. Hybrid Search
The search pipeline combines two approaches:
- Semantic search: BGE embeddings + pgvector find conceptually similar content
- Full-text search: PostgreSQL GIN indexes for exact author/keyword matches
Author queries (e.g., "Tell me about Sarnak's work") automatically use hybrid mode: full-text to find the author's papers, then vector re-ranking for relevance.
3. Multi-LLM Chat
Retrieved chunks are passed to your choice of language model for synthesis:
- DeepSeek v3 – Fast, good at math (default)
- GPT-5.2 – OpenAI's latest
- Claude Sonnet 4.5 – Anthropic's balanced model
- Claude Opus 4.5 – Maximum capability
Example Queries
- "What are the main results in geometric group theory from 2024?"
- "Explain the connection between random matrices and number theory"
- "What did Tao prove about prime gaps?"
- "Recent advances in machine learning for theorem proving"
- "Compare different approaches to the Riemann hypothesis"
Technical Stack
- Database: PostgreSQL + pgvector for vector similarity search
- Embeddings: BGE-large-en-v1.5 (1024 dimensions)
- Backend: FastAPI (Python)
- LLM APIs: Novita (DeepSeek), OpenAI, Anthropic
- Indexing: Parallel workers with incremental updates
Current Status
The system is actively indexing papers. Current coverage:
- 2022-2025: Fully indexed (~200k papers)
- 2010-2021: In progress
- Pre-2010: Queued
Search works on indexed papers now. Full corpus coverage expected within a few weeks.
What's Next
- User authentication: Supabase integration for accounts and usage tracking
- Citation graph: "Papers that cite this" and influence mapping
- Personalization: Saved searches, reading lists, alerts
- API access: Programmatic search for research tools
Questions or feedback? Get in touch.