Polya-Szego + Aristotle
From 2.8% to 100%: Automated Proof Verification with Aristotle
Key Result
Using Aristotle, an automated Lean proof assistant, we achieved 100% verification success on 80 submitted Polya-Szego formalizations. Aristotle automatically corrects formalization errors and synthesizes complete, machine-checkable proofs.
1. Recap: The Formalization Gap
In our initial benchmark, we found that frontier LLMs achieve 95%+ accuracy on informal mathematical reasoning but only 2.8% complete Lean proofs. We called this the "30x gap"—models understand mathematics but cannot produce formally verified proofs.
2. Enter Aristotle
Aristotle is an automated theorem prover that takes informal mathematical statements (or rough Lean formalizations) and produces complete, verified Lean 4 proofs. Unlike traditional proof assistants that require expert guidance, Aristotle operates autonomously.
2.1 Our Pipeline
- LLM Formalization: DeepSeek V3.2 generates initial Lean 4 statements from Polya-Szego problems
- Aristotle Submission: Each formalization is submitted with problem context
- Autonomous Proving: Aristotle corrects errors and synthesizes proofs
- Verification: Output is machine-checkable Lean 4 code
2.2 Results
| Metric | Value |
|---|---|
| Problems submitted | 80 |
| Successfully verified | 80 (100%) |
| Remaining to submit | 238 |
| Total formalizations | 318 |
3. Worked Example: Problem 112
Let's trace a complete example from original problem to verified proof.
3.1 Original Problem (Polya-Szego)
Prove that $\text{Re}(\bar{a} \cdot e^{i\varphi}) = |a| \cos(\varphi - \arg a)$ for any complex number $a$ and real $\varphi$.
3.2 LLM Formalization (DeepSeek V3.2)
The initial formalization had identifier errors (Complex.abs instead of norm,
conj instead of star):
theorem problem_112_part_one :
forall (a : C) (phi : R),
Complex.re (conj a * Complex.exp (Complex.I * phi)) =
Complex.abs a * Real.cos (phi - Complex.arg a) := by
sorry
3.3 Aristotle Output
Aristotle automatically corrected the identifiers and produced a complete proof:
/-
This file was generated by Aristotle.
Lean version: leanprover/lean4:v4.24.0
Mathlib version: f897ebcf72cd16f89ab4577d0c826cd14afaafc7
-/
import Mathlib
/-
Re(a* * exp(i*phi)) = |a| * cos(phi - arg(a))
-/
theorem problem_112_part_one (a : C) (phi : R) :
(star a * Complex.exp (Complex.I * phi)).re =
||a|| * Real.cos (phi - Complex.arg a) := by
norm_num [Complex.exp_re, Complex.exp_im, Real.cos_sub]
rw [<- Complex.norm_mul_cos_arg, <- Complex.norm_mul_sin_arg]
ring
3.4 What Aristotle Fixed
Before (LLM)
conj a→ wrong identifierComplex.abs a→ wrong APIsorry→ no proof
After (Aristotle)
star a→ correct Mathlib||a||→ norm notation- Complete 3-line proof
4. More Complex Example: Problem 10
Problem 10 concerns velocity and acceleration in polar coordinates—a multi-part theorem requiring chain rule, product rule, and careful bookkeeping.
Let $z = re^{i\theta}$ where $r, \theta$ are functions of time $t$. Compute the radial and tangential components of velocity and acceleration.
Aristotle produced a 110-line verified proof establishing:
- Velocity: $v_r = \dot{r}$, $v_\theta = r\dot{\theta}$
- Acceleration: $a_r = \ddot{r} - r\dot{\theta}^2$, $a_\theta = r\ddot{\theta} + 2\dot{r}\dot{\theta}$
The proof uses ContDiffAt hypotheses, chain rule for complex exponentials, and careful
manipulation of derivatives—exactly the kind of intricate reasoning that LLMs struggle to formalize correctly.
5. Key Insights
5.1 Error Correction is Crucial
Most LLM formalizations have small errors: wrong identifier names, type mismatches, missing imports. Aristotle's ability to automatically correct these errors is what enables 100% success.
5.2 The Bottleneck Has Shifted
| Stage | Before Aristotle | After Aristotle |
|---|---|---|
| Informal reasoning | 95% (solved) | 95% (solved) |
| Formalization | 2.8% (bottleneck) | ~99% (with Aristotle) |
| New bottleneck | - | Submission throughput |
5.3 Implications for Mathematical AI
- LLM + Verifier pipelines work: Imperfect LLM output can be refined by specialized tools
- Formal verification is tractable: Classical analysis problems are within reach
- The "30x gap" can be bridged: Not by better LLMs alone, but by better tooling
6. Data Release
We are releasing all Aristotle-verified proofs as a resource for the formal mathematics community.
Contains: 80 verified Lean 4 proofs, original problem statements, LLM formalizations, and Aristotle outputs.
7. Next Steps
- Complete verification: Submit remaining 238 problems to Aristotle
- Mathlib contributions: Identify lemmas that should be upstreamed
- Broader benchmarks: Apply pipeline to other mathematical texts
Acknowledgments
Thanks to the Aristotle team for their remarkable tool, and to the Mathlib community for building the foundation that makes this work possible.
Report updated January 12, 2026. Verification ongoing—check the repository for latest results.