← Back to Blog

Frontier LLMs Refuse Too Little,
or RLHF Considered Harmful

A short essay on something I noticed while analyzing advanced mathematics benchmarks, and the tool we built to do something about it.

The complaint most often levelled at frontier LLMs is that they refuse too much. They won't write the joke, won't speculate, won't quote the text. The cultural consensus on the alignment side is that models have been trained, by reinforcement learning from human feedback, to err on the side of declining the request.

In research mathematics I have noticed the opposite. The frontier models — GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, all the names — refuse far too little. Specifically, when a prompt is malformed, they don't refuse; they fill in the blanks and produce a confident wrong answer.

This is a different failure mode from the standard one and it has been hiding in plain sight, because it only shows up when prompts are deliberately or accidentally under-specified. Casual chat is robust against it. Research mathematics is not.

I'll show three pieces of evidence, then describe what we built to work around the problem.

The evidence

Finding 1 — Silent typo patching

Consider a math problem that begins with the phrase

Compute the order of the monodromy group of $dY/dz = AY$, where (giant matrix expression).

A diligent human reader halts, and asks: what does the expression refer to? $A$? $\log A$? $A^2$? Something else entirely? There is a high probability that the expression refers to $A$, but until this is known with certainty, we cannot begin to solve the problem. A very sloppy human reader infers, without a moment's hesitation, that the matrix expression is the value of $A$. The author wrote "where" and forgot the "$A =$". The prompt has a typo. The sloppy solver patches it silently.

A frontier LLM does the same thing as a sloppy solver. It silently patches the typo, treats the matrix as $A$, and computes a monodromy order. Sometimes the answer is right; sometimes it is wrong; either way the failure mode is invisible. The model never tells you that it filled in a definition. From the outside it looks like the model solved the problem. From the inside it solved a problem, one whose precise statement it constructed on the fly.

Finding 2 — Extended thinking amplifies the problem

Consider a more substantive case:

Let $C$ be a smooth projective curve. Consider the moduli space of stable pairs $(s, E)$, where $E$ is a torsion-free coherent sheaf of rank $r$ on $C$ and $s\colon \mathcal{O}_C^r \to E$ has zero-dimensional cokernel of length $\ell$. What is the dimension of the fiber of the natural projection to $\mathrm{Hilb}^\ell(C)$?

This prompt is under-specified in at least four ways that affect the answer. "Stable pairs" has at least three inequivalent definitions in the literature (Le Potier, Bradlow, Pandharipande–Thomas). The map "natural projection" is referenced but never constructed. The base field is unstated. The genus of $C$ is unstated.

We ran the same prompt through nine different validator models. Eight of them behaved like the diligent reader: they produced substantive lists of these defects and asked a clarifying question, refusing to commit to an answer until the prompt was made precise. The ninth — the one model that declared the prompt well-posed and emitted a "solution" — behaved like the sloppy reader, only with more resources at its disposal. It was the frontier model running at the highest reasoning effort. Extended thinking, in that one case, did not surface the ambiguity. It used the thinking budget to silently resolve the ambiguity by picking a default reading, and then answered.

This is the inverse of what you would naively expect from "deeper reasoning." More thinking, in this experiment, produced more sloppy-reader behavior, not less.

Finding 3 — More than half of benchmark problems are not well-posed

We took close to 200 problems from a curated commercial research-mathematics benchmark — problems that had been written by mathematicians, reviewed by other mathematicians, and accepted by the final model developers as valid. We ran each one through a separate well-formedness validator (no relation to the solver path).

Roughly more than half of the problems were flagged as not well-posed. Of those flagged, about a third were fatal — a required symbol or piece of data was missing entirely, and the problem cannot be answered as stated.

Another half were substantive — ambiguous conventions, missing parameters, unspecified equivalence relations. The remaining minority were cosmetic typos or "intentional under-specification" where the curator was implicitly asking for the tightest-known bound rather than a unique number.

In no case, looking at the data, did the original solver answer "I cannot answer this; please clarify." It always answered. Sometimes the answer matched the curator's intent. Sometimes it didn't. The benchmark scoring methodology cannot tell the two cases apart, because it scores "wrong" identically to "answered a different question than the curator was asking."

Why this happens

Three mechanisms compound:

RLHF rewards helpfulness. Annotators rate responses that produce an answer above responses that ask a clarifying question, almost regardless of context. A model that refuses on ambiguous input scores lower than a model that picks a plausible reading and proceeds. The reward signal is the same in research-mathematics contexts as in casual chat — but the appropriate behavior is very different.

Structured output (JSON-mode and friends) is a commitment device, not an escape hatch. This was a surprising negative result for us. When we wrapped the solver in a JSON schema with a free-form final_answer field — exactly the shape that should allow refusal as a value of final_answer — the refusal rate dropped relative to free-form bare prompting. The schema, by demanding a value for every field, pressures the model to commit to a number rather than to express uncertainty. Practitioners using structured outputs for evaluation pipelines may be silently suppressing the very behavior they think they are encouraging.

Deeper reasoning amplifies DWIM rather than suppressing it. At higher reasoning effort the model uses its thinking budget to resolve ambiguity by picking standard conventions, rather than to flag the ambiguity. The Opus-4.7-thinking case described above is one clean example; the same pattern shows up across other models when the reasoning effort knob is turned up.

These mechanisms are not unalignment; they are alignment working as designed for one set of use cases and producing the wrong behavior for another. Research-grade mathematics is the wrong use case. Precision is the point. Filling in undefined symbols silently is not a feature.

What we built

Polya is a small validator that you can use today, free, at polya.dimensionreducers.ai. The architecture is intentionally simple.

You paste a math problem. Polya runs a fast well-formedness pass on it — using whichever validator you pick from a dropdown, or in "Pro" mode running three validators in parallel and reconciling their findings. If the problem is well-formed, Polya gives you back a cleaned-up version of your prompt. If it isn't, Polya asks you a single plain-English clarifying question — what's $A$? what stability condition? which base field? — and lets you answer. Most clarification loops resolve in one or two turns, because most users have no idea their prompt is malformed and being told it is, is most of the fix.

Polya does not solve the problem. It produces a fully-specified version of your prompt that you can paste into the strong solver of your choice — Claude, ChatGPT, Gemini, whatever — and get back an answer that is at least addressed to the problem you actually meant to ask. The economics work because validation is cheap; the expensive part stays under your control.

Polya also looks up cited references. If you put an arXiv identifier in your prompt, Polya fetches the abstract and inlines it into the cleaned-up version, so downstream solvers — even ones whose training cutoff predates the paper — have the relevant context. Web search via a separate API handles named theorems and topics. Everything Polya retrieved is shown in a panel, so you can verify the citation matches before accepting the cleaned prompt.

Try Polya on one of your harder prompts — free during beta.

Open Polya →

What we haven't shown yet

This is a short essay. The full empirical work is in progress, and the preprint will appear on arXiv in a few weeks. Specifically not yet pinned down: cross-model generalization of the schema-suppresses-refusal effect, an adversarial study where well-formed problems are deliberately corrupted to measure detection rate, and manual adjudication of the claimed "reference-wrong" cases by a second mathematician. Treat the numbers in this essay as the headline rather than the complete picture.

A friendly suggestion

If you work for one of the AI-training-data companies that builds curated math benchmarks: try Polya on one of your harder prompts. Genuinely try. Pick something complex and reviewed.

If the validator declares it well-formed: congratulations.

If it doesn't: that's worth a minute of your time.


Polya is live at polya.dimensionreducers.ai. The full paper is coming.