Research Reports | Igor Rivin

Frontier LLMs Refuse Too Little, or RLHF Considered Harmful

May 15, 2026 • Essay

In research mathematics, frontier LLMs exhibit the opposite of the standard alignment complaint: they refuse too little. When prompts are malformed they silently fill in missing definitions and emit confident wrong answers. Over half of problems in a curated commercial benchmark were flagged as not well-posed by our validator — and no solver ever objected. Three mechanisms compound: RLHF reward bias toward helpfulness, structured-output schemas that suppress refusal, and extended thinking that resolves rather than flags ambiguity. Introduces Polya, a free well-formedness validator with 3-model consensus mode.

New RLHF Research Mathematics Benchmarking Polya

ArXiv Math Semantic Search

January 12, 2026 • Tool Announcement

AI-powered semantic search and Q&A over 700,000+ arXiv mathematics papers. Ask natural language questions, search by concept, explore author connections. Free MVP available at igor.ngrok.pro.

Live MVP Free Semantic Search RAG

Polya-Szego + Aristotle: From 2.8% to 100% Verified Proofs

January 12, 2026 • Research Update

Follow-up to our initial benchmark: using Aristotle to automatically verify LLM-generated Lean proofs. 80/80 submissions verified successfully (100% success rate). Aristotle corrects formalization errors and synthesizes complete, machine-checkable proofs. Data release: github.com/igorrivin/polya-szego-lean

Aristotle 100% Verified Lean4 Automated Proving

Polya-Szego Evaluation: LLM Reasoning vs Formal Verification

January 2, 2026 • Research Report

Comprehensive evaluation of frontier LLMs on 320 problems from Polya-Szego's "Problems and Theorems in Analysis I" (1925). Key finding: 95.4% informal reasoning accuracy but only 2.8% complete Lean4 proofs. The 30x gap reveals that formal verification, not understanding, is the bottleneck in AI mathematics.

Lean4 DeepSeek-Prover Mathematical AI Formal Verification Benchmarking