Foundation model for tabular data slashes training from hours to seconds
Good ol' spreadsheet data could benefit from 'revolutionary' approach to ML inferences
Move over ChatGPT and DALL-E: Spreadsheet data is getting its own foundation machine learning model, allowing users to immediately make inferences about new data points for data sets with up to 10,000 rows and 500 columns.
One commentator said the development could be "revolutionary" for the speed at which users can make predictions using tabular data.
Foundation models such as OpenAI's ChatGPT are pre-trained on vast data sets and provide a general basis for developers to build more specialist models without such extensive training.
A team led by Frank Hutter, professor of machine learning at the University of Freiburg, has developed a foundation model for tabular machine learning, which can make immediate inferences based on tables of data. Predictions based on tabular data – essentially spreadsheet data – are valuable in a wide variety of scenarios, from social media moderation to hospital decision-making.
"The authors' advance is expected to have a profound effect in many areas," said Duncan McElfresh, a senior data engineer at Stanford Health Care, part of Stanford University.
The study, published in Nature last week, explains how the team built the foundation model, TabPFN, to learn causal relationships from synthetic data, which has been modeled on real scenarios, creating data tables in which the entries in the individual table columns are causally linked. The new model was trained with 100 million such synthetic data sets, allowing it to narrow down possible causal relationships and use them for its predictions.
- AWS follows Iceberg path to unite analytics platform
- AWS introduces S3 Tables, a new bucket type for data analytics
- The force is strong in Iceberg: Are the table format wars entering the final chapter?
- Snowflake seeks to raise funds with $2B bond sale
In an accompanying article, McElfresh said: "The authors' foundation model is ... remarkably effective. It can take a user's data set and immediately make inferences about new data points ... Using a battery of experiments, [the researchers] found that TabPFN consistently outperforms other machine learning methods – automated or otherwise – for data sets with up to 10,000 rows and 500 columns. It is also more adept than other methods at coping with common data problems such as missing values, outliers, and uninformative features. And whereas conventional machine learning models require minutes or even hours to train, TabPFN can produce inferences for a new data set in fractions of a second."
In the paper, the authors said that by improving modeling abilities across diverse fields, TabPFN could accelerate scientific discovery and enhance important decision-making in various domains.
"This shift towards foundation models trained on synthetic data opens up new possibilities for tabular data analysis across various domains," the researchers said. "Future work could explore creating specialized priors to handle data types such as time series and multi-modal data or specialized modalities such as ECG, neuroimaging data, and genetic data. As the field of tabular data modeling continues to evolve, we believe that foundation models, such as TabPFN, will play a key part in empowering researchers." ®