Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 problems spanning 8 domains from computational and applied mathematics research that are predominantly unsolved, paired with an open-source evaluation framework for automated verification. Our benchmark targets a class of problems where discovery is hard, requiring meaningful mathematical insight, but verification is computationally efficient and simple. Because these solutions are unknown, HorizonMath is immune to data contamination, and state-of-the-art models score near 0%. With this platform, we find two problems for which GPT 5.4 Pro proposes novel solutions that potentially improve on the best-known baseline results published in the literature. We release this benchmark as an open challenge where solutions to even a single problem would constitute a novel result in the mathematical literature.
101 problems across 8 mathematical domains. Problems without known solutions cannot exist in any training corpus—any correct solution signals true reasoning and autonomous discovery.
| Level | Description | Count |
|---|---|---|
| 0 | Calibration (solved) | 10 |
| 1 | Likely solvable | 23 |
| 2 | Challenging | 60 |
| 3 | Likely unsolvable | 8 |
| Domain | Count |
|---|---|
| Number Theory | 17 |
| Discrete Geometry | 15 |
| Lattice Models | 15 |
| Continuum Physics | 13 |
| Combinatorics | 12 |
| Integrals | 11 |
| Coding Theory | 9 |
| Mathematical Constants | 9 |
We evaluate three frontier models on 101 research-level math problems. State-of-the-art models score near 0%, but GPT 5.4 Pro produces 2 potentially novel mathematical results that improve on published baselines.
| Model | Full Dataset | Level 0 (Calibration) | Novel Solutions |
|---|---|---|---|
| GPT 5.4 Pro | 7% | 50% | 2 |
| Claude Opus 4.6 | 3% | 30% | 0 |
| Gemini 3.1 Pro | 3% | 30% | 0 |
GPT 5.4 Pro achieved c \approx 3.7296, improving on the best-known c \approx 3.7992 in R(k,k) \le c^{k+o(k)} by tuning the correction polynomial and piecewise-constant functions within the Gupta–Ndiaye–Norin–Wei framework.
Reduced the union area by approximately 8.44% over the Keich baseline using a Haar-style overlap pattern with locally optimized block coefficients.
HorizonMath targets problems where discovery is hard but verification is efficient. Here are three representative problems illustrating the range of domains and evaluation modes.
Laurenzi, B. J. (1993). Moment integrals of powers of Airy functions. Z. angew. Math. Phys., 44, 891–908.
The Airy power moments are defined by:
Closed forms are known for small n, but the fifth moment a_5 \approx 0.001349358983\dots lacks a known symbolic expression. The model must find a closed-form expression that matches the ground truth to at least 20 decimal digits.
def proposed_solution():
from mpmath import mp
mp.dps = 100
# Use only mpmath constants and special functions
result = ... # symbolic closed-form expression
return result
Falloon, P. E., Abbott, P. C., & Wang, J. B. (2003). Theory and computation of spheroidal wavefunctions. J. Phys. A, 36(20), 5477–5495.
Let c \ge 0 be a real parameter. Consider the Sturm–Liouville eigenvalue problem on (-1,1):
with the boundary condition that y(x) remains bounded as x \to \pm 1. The spectrum is discrete and real:
and \lambda_n(0) = n(n+1). General symbolic closed forms remain undiscovered. Task: Find a symbolic closed-form expression for \lambda_n(c) valid for general integer n \ge 0 and real c \ge 0.
def proposed_solution(n, c):
from mpmath import mp
mp.dps = 100
result = ... # closed-form for lambda_n(c)
return result
Shehadeh, M., Kingsford, W., & Kschischang, F. R. (2026). New Difference Triangle Sets by an FPGA-Based Search Technique. J. Combin. Des., 34(1).
An (n,k)-DTS is an n \times (k+1) array A with entries a_{i,j} such that each row is strictly increasing and normalized:
Define the set of positive within-row differences:
All elements of D must be distinct. The scope is m(A) = \max_{i,j} a_{i,j}. Task: Find a valid (7,5)-DTS with scope strictly less than the current best-known upper bound m(7,5) \le 112.
def proposed_solution():
return {
"n": 7, "k": 5,
"rows": [
[0, a01, a02, a03, a04, a05],
... # 7 rows total
]
}
| Dataset | Problems | Unsolved | Open Eval | Auto-Verify |
|---|---|---|---|---|
| FrontierMath: Open Problems | 14 | ✓ | ✗ | ✓ |
| IMProofBench | 39 | ✗ | ✗ | ✗ |
| First Proof | 10 | ✗ | ✗ | ✗ |
| Erdős Problems | 1,000+ | ✓ | ✗ | ✗ |
| IMO-AnswerBench | 400 | ✗ | ✓ | ✓ |
| Optimization Constants | 96 | ✓ | ✗ | ✓ |
| HorizonMath | 101 | ✓ | ✓ | ✓ |
@article{wang2026horizonmathmeasuringaiprogress,
title={HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification},
author={Erik Y. Wang and Sumeet Motwani and James V. Roggeveen and Eliot Hodges and Dulhan Jayalath and Charles London and Kalyan Ramakrishnan and Flaviu Cipcigan and Philip Torr and Alessandro Abate},
year={2026},
eprint={2603.15617},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.15617},
}