← Back to Home
Research Reports
Technical reports on AI, mathematics, and formal verification
May 15, 2026 • Essay
In research mathematics, frontier LLMs exhibit the opposite of the standard alignment complaint:
they refuse too little. When prompts are malformed they silently fill in missing definitions and emit
confident wrong answers. Over half of problems in a curated commercial benchmark were flagged as not
well-posed by our validator — and no solver ever objected. Three mechanisms compound: RLHF reward
bias toward helpfulness, structured-output schemas that suppress refusal, and extended thinking that
resolves rather than flags ambiguity. Introduces Polya,
a free well-formedness validator with 3-model consensus mode.
New
RLHF
Research Mathematics
Benchmarking
Polya
January 12, 2026 • Tool Announcement
AI-powered semantic search and Q&A over 700,000+ arXiv mathematics papers.
Ask natural language questions, search by concept, explore author connections.
Free MVP available at igor.ngrok.pro.
Live MVP
Free
Semantic Search
RAG
January 12, 2026 • Research Update
Follow-up to our initial benchmark: using Aristotle to automatically verify
LLM-generated Lean proofs. 80/80 submissions verified successfully (100% success rate).
Aristotle corrects formalization errors and synthesizes complete, machine-checkable proofs.
Data release: github.com/igorrivin/polya-szego-lean
Aristotle
100% Verified
Lean4
Automated Proving
January 2, 2026 • Research Report
Comprehensive evaluation of frontier LLMs on 320 problems from Polya-Szego's
"Problems and Theorems in Analysis I" (1925). Key finding: 95.4% informal reasoning
accuracy but only 2.8% complete Lean4 proofs. The 30x gap reveals that formal
verification, not understanding, is the bottleneck in AI mathematics.
Lean4
DeepSeek-Prover
Mathematical AI
Formal Verification
Benchmarking