
GPT-4 parsed 87% of 19th-century British trade tables into structured panel data. The systematic failures on multi-level headers create a precision-recall trade-off.
A new study from researchers at the University of Zurich and Cornell tests whether large language models can reliably convert scanned historical tables into panel data, the structured format that underlies empirical finance and economics. The answer: a qualified yes.
The preprint, posted on arXiv, runs GPT-4 on 3,000 tables drawn from 19th-century British trade statistics. The model correctly parsed about 87% of the tables. Errors clustered on multi-level headers and fused cells.
That headline number masks a sharp split. On clean single-index tables, accuracy exceeded 95%. On tables with merged column headers spanning multiple rows, it fell to roughly 65%. The failures followed a pattern. A table with "Iron" as a row label and sub-columns labeled "Pig" and "Bar" passes easily. A table where the same labels sit two rows above their data, separated by a blank row, confuses the model.
The researchers also found that temperature settings matter. At low temperatures, the model hallucinated fewer values but left ambiguous cells blank. At higher temperatures, it filled in missing data but occasionally invented plausible-looking numbers with no basis in the table. A precision-recall trade-off driven more by prompt engineering than model size.
For quant desks that have tried to build a century-long panel on commodity prices, trade flows, or railroad freight rates, the pain is familiar. Optical character recognition handles the OCR layer. The structural step, mapping scanned rows into a time series, has been a manual bottleneck. A tool that automates 87% of that extraction, even with systematic failure modes, changes the economics of historical data projects.
The caveat is validation. The 87% figure comes from a controlled test set with human reviewers checking every output. In production, without that clean labelled set, accuracy would fall. A firm loading a 50-year panel on Argentine wheat exports into its quant model needs the error distribution, not just the error rate. The paper provides that distribution for 19th-century British trade. It does not cover the 20th-century corporate filings, hand-drawn railway charts, or colonial administrative ledgers that most quant shops actually want.
The researchers made their data and code available. They plan to reproduce the result on a different historical corpus, which will test whether the method generalizes or overfits to one era's tabular conventions.
Prepared with AlphaScala research tooling and grounded in primary market data: live prices, fundamentals, SEC filings, hedge-fund holdings, and insider activity. Each story is checked against AlphaScala publishing rules before release. Educational coverage, not personalized advice.