HorizonMath

Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

1University of Oxford   2Benchmark   3Harvard University   4Princeton University   5Ellison Institute of Technology

*Corresponding authors

Abstract

Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 problems spanning 8 domains from computational and applied mathematics research that are predominantly unsolved, paired with an open-source evaluation framework for automated verification. Our benchmark targets a class of problems where discovery is hard, requiring meaningful mathematical insight, but verification is computationally efficient and simple. Because these solutions are unknown, HorizonMath is immune to data contamination, and state-of-the-art models score near 0%. With this platform, we find two problems for which GPT 5.4 Pro proposes novel solutions that potentially improve on the best-known baseline results published in the literature. We release this benchmark as an open challenge where solutions to even a single problem would constitute a novel result in the mathematical literature.

Benchmark Overview

101 problems across 8 mathematical domains. Problems without known solutions cannot exist in any training corpus—any correct solution signals true reasoning and autonomous discovery.

101 Problems
8 Domains
91 Unsolved
3 Eval Modes

Solvability Levels

LevelDescriptionCount
0Calibration (solved)10
1Likely solvable23
2Challenging60
3Likely unsolvable8

Mathematical Domains

DomainCount
Number Theory17
Discrete Geometry15
Lattice Models15
Continuum Physics13
Combinatorics12
Integrals11
Coding Theory9
Mathematical Constants9

Key Results

We evaluate three frontier models on 101 research-level math problems. State-of-the-art models score near 0%, but GPT 5.4 Pro produces 2 potentially novel mathematical results that improve on published baselines.

Model Full Dataset Level 0 (Calibration) Novel Solutions
GPT 5.4 Pro 7% 50% 2
Claude Opus 4.6 3% 30% 0
Gemini 3.1 Pro 3% 30% 0
Novel Result

Asymptotic Upper Bound for Diagonal Ramsey Numbers

GPT 5.4 Pro achieved c \approx 3.7296, improving on the best-known c \approx 3.7992 in R(k,k) \le c^{k+o(k)} by tuning the correction polynomial and piecewise-constant functions within the Gupta–Ndiaye–Norin–Wei framework.

Novel Result

Thin-Triangle Kakeya (128 slopes)

Reduced the union area by approximately 8.44% over the Keich baseline using a Haar-style overlap pattern with locally optimized block coefficients.

Example Problems

HorizonMath targets problems where discovery is hard but verification is efficient. Here are three representative problems illustrating the range of domains and evaluation modes.

Constant Discovery Solvability 2 ground_truth_computable

Fifth Moment of the Airy Function (a_5)

Laurenzi, B. J. (1993). Moment integrals of powers of Airy functions. Z. angew. Math. Phys., 44, 891–908.

The Airy power moments are defined by:

a_n = \int_0^\infty \mathrm{Ai}(x)^n \, dx

Closed forms are known for small n, but the fifth moment a_5 \approx 0.001349358983\dots lacks a known symbolic expression. The model must find a closed-form expression that matches the ground truth to at least 20 decimal digits.

Show expected output format
def proposed_solution():
    from mpmath import mp
    mp.dps = 100
    # Use only mpmath constants and special functions
    result = ...  # symbolic closed-form expression
    return result
Function Discovery Solvability 2 ground_truth_computable

Angular Prolate Spheroidal Eigenvalues (order m = 0)

Falloon, P. E., Abbott, P. C., & Wang, J. B. (2003). Theory and computation of spheroidal wavefunctions. J. Phys. A, 36(20), 5477–5495.

Let c \ge 0 be a real parameter. Consider the Sturm–Liouville eigenvalue problem on (-1,1):

-\frac{d}{dx}\left((1-x^2)\frac{dy}{dx}\right) + c^2 x^2 \, y(x) = \lambda \, y(x), \qquad -1 < x < 1

with the boundary condition that y(x) remains bounded as x \to \pm 1. The spectrum is discrete and real:

0 \le \lambda_0(c) < \lambda_1(c) < \lambda_2(c) < \cdots

and \lambda_n(0) = n(n+1). General symbolic closed forms remain undiscovered. Task: Find a symbolic closed-form expression for \lambda_n(c) valid for general integer n \ge 0 and real c \ge 0.

Show expected output format
def proposed_solution(n, c):
    from mpmath import mp
    mp.dps = 100
    result = ...  # closed-form for lambda_n(c)
    return result
Construction Solvability 1 benchmark_best_known

Minimum-Scope Difference Triangle Set (7,5)

Shehadeh, M., Kingsford, W., & Kschischang, F. R. (2026). New Difference Triangle Sets by an FPGA-Based Search Technique. J. Combin. Des., 34(1).

An (n,k)-DTS is an n \times (k+1) array A with entries a_{i,j} such that each row is strictly increasing and normalized:

0 = a_{i,0} < a_{i,1} < \cdots < a_{i,k}

Define the set of positive within-row differences:

D = \{ a_{i,j} - a_{i,j'} : \text{for all } i, \text{ and } 0 \le j' < j \le k \}

All elements of D must be distinct. The scope is m(A) = \max_{i,j} a_{i,j}. Task: Find a valid (7,5)-DTS with scope strictly less than the current best-known upper bound m(7,5) \le 112.

Show expected output format
def proposed_solution():
    return {
        "n": 7, "k": 5,
        "rows": [
            [0, a01, a02, a03, a04, a05],
            ...  # 7 rows total
        ]
    }

Comparison with Existing Benchmarks

Dataset Problems Unsolved Open Eval Auto-Verify
FrontierMath: Open Problems14
IMProofBench39
First Proof10
Erdős Problems1,000+
IMO-AnswerBench400
Optimization Constants96
HorizonMath101

BibTeX

@article{wang2026horizonmathmeasuringaiprogress,
      title={HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification},
      author={Erik Y. Wang and Sumeet Motwani and James V. Roggeveen and Eliot Hodges and Dulhan Jayalath and Charles London and Kalyan Ramakrishnan and Flaviu Cipcigan and Philip Torr and Alessandro Abate},
      year={2026},
      eprint={2603.15617},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.15617},
}