Databases

Foundation model for tabular data slashes training from hours to seconds

Good ol' spreadsheet data could benefit from 'revolutionary' approach to ML inferences

Wed 15 Jan 2025 // 09:32 UTC

Move over ChatGPT and DALL-E: Spreadsheet data is getting its own foundation machine learning model, allowing users to immediately make inferences about new data points for data sets with up to 10,000 rows and 500 columns.

One commentator said the development could be "revolutionary" for the speed at which users can make predictions using tabular data.

Foundation models such as OpenAI's ChatGPT are pre-trained on vast data sets and provide a general basis for developers to build more specialist models without such extensive training.

A team led by Frank Hutter, professor of machine learning at the University of Freiburg, has developed a foundation model for tabular machine learning, which can make immediate inferences based on tables of data. Predictions based on tabular data – essentially spreadsheet data – are valuable in a wide variety of scenarios, from social media moderation to hospital decision-making.

"The authors' advance is expected to have a profound effect in many areas," said Duncan McElfresh, a senior data engineer at Stanford Health Care, part of Stanford University.

The study, published in Nature last week, explains how the team built the foundation model, TabPFN, to learn causal relationships from synthetic data, which has been modeled on real scenarios, creating data tables in which the entries in the individual table columns are causally linked. The new model was trained with 100 million such synthetic data sets, allowing it to narrow down possible causal relationships and use them for its predictions.

In an accompanying article, McElfresh said: "The authors' foundation model is ... remarkably effective. It can take a user's data set and immediately make inferences about new data points ... Using a battery of experiments, [the researchers] found that TabPFN consistently outperforms other machine learning methods – automated or otherwise – for data sets with up to 10,000 rows and 500 columns. It is also more adept than other methods at coping with common data problems such as missing values, outliers, and uninformative features. And whereas conventional machine learning models require minutes or even hours to train, TabPFN can produce inferences for a new data set in fractions of a second."

In the paper, the authors said that by improving modeling abilities across diverse fields, TabPFN could accelerate scientific discovery and enhance important decision-making in various domains.

"This shift towards foundation models trained on synthetic data opens up new possibilities for tabular data analysis across various domains," the researchers said. "Future work could explore creating specialized priors to handle data types such as time series and multi-modal data or specialized modalities such as ECG, neuroimaging data, and genetic data. As the field of tabular data modeling continues to evolve, we believe that foundation models, such as TabPFN, will play a key part in empowering researchers." ®

Topics

Special Features

Vendor Voice

Resources

Databases

Foundation model for tabular data slashes training from hours to seconds

Good ol' spreadsheet data could benefit from 'revolutionary' approach to ML inferences

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

AWS declares it's Iceberg all the way until customers say otherwise

AI pothole patrol to snap flaws in Britain's crumbling roads

Mail-out madness as insurer offers refunds to customers in error

Fortify your data

AWS follows Iceberg path to unite analytics platform

SAP says GenAI will help solve legacy migration skills shortage

Google DeepMind touts AI model for 'better' global weather forecasting

Wish there was a benchmark for ML safety? Allow us to AILuminate you...

FTC scolds two data brokers for allegedly selling your location to the meter

India spending $170M to take its tax system paperless by rebuilding three legacy systems

One thing AI can't generate at the moment – compelling reasons to use it for work

Database warhorse SQL Server 2025 goes all-in on AI

About Us

Our Websites

Your Privacy