← Previous: Initial Benchmark Report

Polya-Szego + Aristotle

From 2.8% to 100%: Automated Proof Verification with Aristotle

January 12, 2026 • Research Update • 8 min read

Key Result

Using Aristotle, an automated Lean proof assistant, we achieved 100% verification success on 80 submitted Polya-Szego formalizations. Aristotle automatically corrects formalization errors and synthesizes complete, machine-checkable proofs.

1. Recap: The Formalization Gap

In our initial benchmark, we found that frontier LLMs achieve 95%+ accuracy on informal mathematical reasoning but only 2.8% complete Lean proofs. We called this the "30x gap"—models understand mathematics but cannot produce formally verified proofs.

95%
Informal Reasoning
2.8%
LLM Proofs (Before)
100%
Aristotle Verified
80
Problems Verified

2. Enter Aristotle

Aristotle is an automated theorem prover that takes informal mathematical statements (or rough Lean formalizations) and produces complete, verified Lean 4 proofs. Unlike traditional proof assistants that require expert guidance, Aristotle operates autonomously.

2.1 Our Pipeline

  1. LLM Formalization: DeepSeek V3.2 generates initial Lean 4 statements from Polya-Szego problems
  2. Aristotle Submission: Each formalization is submitted with problem context
  3. Autonomous Proving: Aristotle corrects errors and synthesizes proofs
  4. Verification: Output is machine-checkable Lean 4 code

2.2 Results

MetricValue
Problems submitted80
Successfully verified80 (100%)
Remaining to submit238
Total formalizations318
100% success rate on all 80 submitted problems. Aristotle corrected formalization errors (identifier names, type mismatches) and produced complete proofs.

3. Worked Example: Problem 112

Let's trace a complete example from original problem to verified proof.

3.1 Original Problem (Polya-Szego)

Problem 112 (Part One, Chapter 3):
Prove that $\text{Re}(\bar{a} \cdot e^{i\varphi}) = |a| \cos(\varphi - \arg a)$ for any complex number $a$ and real $\varphi$.

3.2 LLM Formalization (DeepSeek V3.2)

The initial formalization had identifier errors (Complex.abs instead of norm, conj instead of star):

theorem problem_112_part_one :
    forall (a : C) (phi : R),
    Complex.re (conj a * Complex.exp (Complex.I * phi)) =
    Complex.abs a * Real.cos (phi - Complex.arg a) := by
  sorry

3.3 Aristotle Output

Aristotle automatically corrected the identifiers and produced a complete proof:

/-
This file was generated by Aristotle.
Lean version: leanprover/lean4:v4.24.0
Mathlib version: f897ebcf72cd16f89ab4577d0c826cd14afaafc7
-/

import Mathlib

/-
Re(a* * exp(i*phi)) = |a| * cos(phi - arg(a))
-/
theorem problem_112_part_one (a : C) (phi : R) :
    (star a * Complex.exp (Complex.I * phi)).re =
    ||a|| * Real.cos (phi - Complex.arg a) := by
  norm_num [Complex.exp_re, Complex.exp_im, Real.cos_sub]
  rw [<- Complex.norm_mul_cos_arg, <- Complex.norm_mul_sin_arg]
  ring

3.4 What Aristotle Fixed

Before (LLM)

  • conj a → wrong identifier
  • Complex.abs a → wrong API
  • sorry → no proof

After (Aristotle)

  • star a → correct Mathlib
  • ||a|| → norm notation
  • Complete 3-line proof

4. More Complex Example: Problem 10

Problem 10 concerns velocity and acceleration in polar coordinates—a multi-part theorem requiring chain rule, product rule, and careful bookkeeping.

Problem 10 (Part Three, Chapter 1):
Let $z = re^{i\theta}$ where $r, \theta$ are functions of time $t$. Compute the radial and tangential components of velocity and acceleration.

Aristotle produced a 110-line verified proof establishing:

The proof uses ContDiffAt hypotheses, chain rule for complex exponentials, and careful manipulation of derivatives—exactly the kind of intricate reasoning that LLMs struggle to formalize correctly.

5. Key Insights

5.1 Error Correction is Crucial

Most LLM formalizations have small errors: wrong identifier names, type mismatches, missing imports. Aristotle's ability to automatically correct these errors is what enables 100% success.

5.2 The Bottleneck Has Shifted

StageBefore AristotleAfter Aristotle
Informal reasoning95% (solved)95% (solved)
Formalization2.8% (bottleneck)~99% (with Aristotle)
New bottleneck-Submission throughput

5.3 Implications for Mathematical AI

  1. LLM + Verifier pipelines work: Imperfect LLM output can be refined by specialized tools
  2. Formal verification is tractable: Classical analysis problems are within reach
  3. The "30x gap" can be bridged: Not by better LLMs alone, but by better tooling

6. Data Release

We are releasing all Aristotle-verified proofs as a resource for the formal mathematics community.

Repository: github.com/igorrivin/polya-szego-lean
Contains: 80 verified Lean 4 proofs, original problem statements, LLM formalizations, and Aristotle outputs.

7. Next Steps


Acknowledgments

Thanks to the Aristotle team for their remarkable tool, and to the Mathlib community for building the foundation that makes this work possible.

Report updated January 12, 2026. Verification ongoing—check the repository for latest results.