← Back to Home

Blog

Notes on AI, mathematics, and formal verification

Frontier LLMs Refuse Too Little, or RLHF Considered Harmful

May 15, 2026

In research mathematics, frontier LLMs fail in the opposite direction from the usual complaint: they refuse too little. When a prompt is malformed they silently fill in the blanks and produce a confident wrong answer. We found over half of problems in a curated benchmark were not well-posed — and no solver ever said so. Here's what we built to fix it.

New RLHF Research Mathematics Benchmarking Polya

ArXiv Math Semantic Search

January 12, 2026

AI-powered semantic search and Q&A over 700,000+ arXiv mathematics papers. Ask questions, search by concept, explore connections. Free MVP live now.

Live MVP RAG pgvector

Polya-Szego + Aristotle: From 2.8% to 100% Verified Proofs

January 12, 2026

Using Aristotle to automatically verify LLM-generated Lean proofs from Polya-Szego. 80/80 submissions verified successfully (100% success rate). The "30x gap" between informal reasoning and formal verification can be bridged with the right tools.

Aristotle Lean4 Formal Verification

Polya-Szego Evaluation: LLM Reasoning vs Formal Verification

January 2, 2026

Evaluating frontier LLMs on 320 problems from Polya-Szego's classic text. Key finding: 95.4% informal reasoning accuracy but only 2.8% complete Lean4 proofs. The bottleneck in AI mathematics is formalization, not understanding.

DeepSeek-Prover Mathematical AI Benchmarking

Math OCR Benchmark: Why Gemini Flash Beats Mathpix

January 1, 2026

Benchmarking Mathpix against LLM-based OCR for mathematical PDFs. Gemini 3 Flash is 6x cheaper and more accurate for extracting equations from academic papers and textbooks.

OCR Gemini LaTeX