May 15, 2026
In research mathematics, frontier LLMs fail in the opposite direction from the usual
complaint: they refuse too little. When a prompt is malformed they silently fill in
the blanks and produce a confident wrong answer. We found over half of problems in a
curated benchmark were not well-posed — and no solver ever said so. Here's what
we built to fix it.
New
RLHF
Research Mathematics
Benchmarking
Polya
January 12, 2026
AI-powered semantic search and Q&A over 700,000+ arXiv mathematics papers.
Ask questions, search by concept, explore connections. Free MVP live now.
Live MVP
RAG
pgvector
January 12, 2026
Using Aristotle to automatically verify LLM-generated Lean proofs from Polya-Szego.
80/80 submissions verified successfully (100% success rate). The "30x gap" between
informal reasoning and formal verification can be bridged with the right tools.
Aristotle
Lean4
Formal Verification
January 2, 2026
Evaluating frontier LLMs on 320 problems from Polya-Szego's classic text.
Key finding: 95.4% informal reasoning accuracy but only 2.8% complete Lean4 proofs.
The bottleneck in AI mathematics is formalization, not understanding.
DeepSeek-Prover
Mathematical AI
Benchmarking
January 1, 2026
Benchmarking Mathpix against LLM-based OCR for mathematical PDFs.
Gemini 3 Flash is 6x cheaper and more accurate for extracting equations from
academic papers and textbooks.
OCR
Gemini
LaTeX