Blog | Igor Rivin

Frontier LLMs Refuse Too Little, or RLHF Considered Harmful

May 15, 2026

In research mathematics, frontier LLMs fail in the opposite direction from the usual complaint: they refuse too little. When a prompt is malformed they silently fill in the blanks and produce a confident wrong answer. We found over half of problems in a curated benchmark were not well-posed — and no solver ever said so. Here's what we built to fix it.

New RLHF Research Mathematics Benchmarking Polya

ArXiv Math Semantic Search

January 12, 2026

AI-powered semantic search and Q&A over 700,000+ arXiv mathematics papers. Ask questions, search by concept, explore connections. Free MVP live now.

Live MVP RAG pgvector

Polya-Szego + Aristotle: From 2.8% to 100% Verified Proofs

January 12, 2026

Using Aristotle to automatically verify LLM-generated Lean proofs from Polya-Szego. 80/80 submissions verified successfully (100% success rate). The "30x gap" between informal reasoning and formal verification can be bridged with the right tools.

Aristotle Lean4 Formal Verification

Polya-Szego Evaluation: LLM Reasoning vs Formal Verification

January 2, 2026

Evaluating frontier LLMs on 320 problems from Polya-Szego's classic text. Key finding: 95.4% informal reasoning accuracy but only 2.8% complete Lean4 proofs. The bottleneck in AI mathematics is formalization, not understanding.

DeepSeek-Prover Mathematical AI Benchmarking

Math OCR Benchmark: Why Gemini Flash Beats Mathpix

January 1, 2026

Benchmarking Mathpix against LLM-based OCR for mathematical PDFs. Gemini 3 Flash is 6x cheaper and more accurate for extracting equations from academic papers and textbooks.

OCR Gemini LaTeX