HorizonMath: Measuring AI Progress Toward Mathematical Discovery

Abstract

Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 problems spanning 8 domains from computational and applied mathematics research that are predominantly unsolved, paired with an open-source evaluation framework for automated verification. Our benchmark targets a class of problems where discovery is hard, requiring meaningful mathematical insight, but verification is computationally efficient and simple. Because these solutions are unknown, HorizonMath is immune to data contamination, and state-of-the-art models score near 0%. With this platform, we find two problems for which GPT 5.4 Pro proposes novel solutions that potentially improve on the best-known baseline results published in the literature. We release this benchmark as an open challenge where solutions to even a single problem would constitute a novel result in the mathematical literature.

Benchmark Overview

101 problems across 8 mathematical domains. Problems without known solutions cannot exist in any training corpus—any correct solution signals true reasoning and autonomous discovery.

101 Problems

8 Domains

91 Unsolved

3 Eval Modes

Solvability Levels

Level	Description	Count
0	Calibration (solved)	10
1	Likely solvable	23
2	Challenging	60
3	Likely unsolvable	8

Mathematical Domains

Domain	Count
Number Theory	17
Discrete Geometry	15
Lattice Models	15
Continuum Physics	13
Combinatorics	12
Integrals	11
Coding Theory	9
Mathematical Constants	9

Key Results

We evaluate three frontier models on 101 research-level math problems. State-of-the-art models score near 0%, but GPT 5.4 Pro produces 2 potentially novel mathematical results that improve on published baselines.

Model	Full Dataset	Level 0 (Calibration)	Novel Solutions
GPT 5.4 Pro	7%	50%	2
Claude Opus 4.6	3%	30%	0
Gemini 3.1 Pro	3%	30%	0

Novel Result

Asymptotic Upper Bound for Diagonal Ramsey Numbers

GPT 5.4 Pro achieved c \approx 3.7296, improving on the best-known c \approx 3.7992 in R(k,k) \le c^{k+o(k)} by tuning the correction polynomial and piecewise-constant functions within the Gupta–Ndiaye–Norin–Wei framework.

Novel Result

Thin-Triangle Kakeya (128 slopes)

Reduced the union area by approximately 8.44% over the Keich baseline using a Haar-style overlap pattern with locally optimized block coefficients.

Example Problems

HorizonMath targets problems where discovery is hard but verification is efficient. Here are three representative problems illustrating the range of domains and evaluation modes.

Constant Discovery Solvability 2 ground_truth_computable

Fifth Moment of the Airy Function (a_5)

Laurenzi, B. J. (1993). Moment integrals of powers of Airy functions. Z. angew. Math. Phys., 44, 891–908.

The Airy power moments are defined by:

a_n = \int_0^\infty \mathrm{Ai}(x)^n \, dx

Closed forms are known for small n, but the fifth moment a_5 \approx 0.001349358983\dots lacks a known symbolic expression. The model must find a closed-form expression that matches the ground truth to at least 20 decimal digits.

Show expected output format

def proposed_solution():
    from mpmath import mp
    mp.dps = 100
    # Use only mpmath constants and special functions
    result = ...  # symbolic closed-form expression
    return result

Function Discovery Solvability 2 ground_truth_computable

Angular Prolate Spheroidal Eigenvalues (order m = 0)

Falloon, P. E., Abbott, P. C., & Wang, J. B. (2003). Theory and computation of spheroidal wavefunctions. J. Phys. A, 36(20), 5477–5495.

Let c \ge 0 be a real parameter. Consider the Sturm–Liouville eigenvalue problem on (-1,1):

-\frac{d}{dx}\left((1-x^2)\frac{dy}{dx}\right) + c^2 x^2 \, y(x) = \lambda \, y(x), \qquad -1 < x < 1

with the boundary condition that y(x) remains bounded as x \to \pm 1. The spectrum is discrete and real:

0 \le \lambda_0(c) < \lambda_1(c) < \lambda_2(c) < \cdots

and \lambda_n(0) = n(n+1). General symbolic closed forms remain undiscovered. Task: Find a symbolic closed-form expression for \lambda_n(c) valid for general integer n \ge 0 and real c \ge 0.

Show expected output format

def proposed_solution(n, c):
    from mpmath import mp
    mp.dps = 100
    result = ...  # closed-form for lambda_n(c)
    return result

Construction Solvability 1 benchmark_best_known

Minimum-Scope Difference Triangle Set (7,5)

Shehadeh, M., Kingsford, W., & Kschischang, F. R. (2026). New Difference Triangle Sets by an FPGA-Based Search Technique. J. Combin. Des., 34(1).

An (n,k)-DTS is an n \times (k+1) array A with entries a_{i,j} such that each row is strictly increasing and normalized:

0 = a_{i,0} < a_{i,1} < \cdots < a_{i,k}

Define the set of positive within-row differences:

D = \{ a_{i,j} - a_{i,j'} : \text{for all } i, \text{ and } 0 \le j' < j \le k \}

All elements of D must be distinct. The scope is m(A) = \max_{i,j} a_{i,j}. Task: Find a valid (7,5)-DTS with scope strictly less than the current best-known upper bound m(7,5) \le 112.

Show expected output format

def proposed_solution():
    return {
        "n": 7, "k": 5,
        "rows": [
            [0, a01, a02, a03, a04, a05],
            ...  # 7 rows total
        ]
    }

Comparison with Existing Benchmarks

Dataset	Problems	Unsolved	Open Eval	Auto-Verify
FrontierMath: Open Problems	14	✓	✗	✓
IMProofBench	39	✗	✗	✗
First Proof	10	✗	✗	✗
Erdős Problems	1,000+	✓	✗	✗
IMO-AnswerBench	400	✗	✓	✓
Optimization Constants	96	✓	✗	✓
HorizonMath	101	✓	✓	✓

BibTeX

@article{wang2026horizonmathmeasuringaiprogress,
      title={HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification},
      author={Erik Y. Wang and Sumeet Motwani and James V. Roggeveen and Eliot Hodges and Dulhan Jayalath and Charles London and Kalyan Ramakrishnan and Flaviu Cipcigan and Philip Torr and Alessandro Abate},
      year={2026},
      eprint={2603.15617},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.15617},
}