AWS declares it's Iceberg all the way until customers say otherwise

Cloud giant explains its thinking behind support for Apache open table format

AWS bet on the Apache Iceberg open table format (OTF) across its analytics, machine learning, and storage stack as a concerted response to demand from customers already using its popular S3 object storage.

While there is a growing consensus around Iceberg, questions remain about the future of rival OTF Delta Lake, created by Databricks and made open source under the stewardship of the Linux Foundation, and currently the format of choice among software giants Microsoft and SAP.

But for the world's largest cloud platform provider, it is a done deal on Iceberg until customers of its S3 service say otherwise.

The importance of the stance is due to a couple of facts. S3 enjoys around 23 percent market share in the global enterprise data storage software market and AWS is set to take in $105 billion in annual revenue, making it the largest cloud infrastructure provider by some way.

The importance of Iceberg is also marked by Databricks' decision to pay $1 billion (maybe $2 billion) for Tabular, the company founded by the original authors of Iceberg, without even getting its hand on the technology, which is open source.

Andy Warfield, AWS veep and distinguished engineer, told The Register: "We are working directly with Iceberg. We have core committers on the Iceberg open source stack, so AWS is an active committer to Iceberg itself, where we're shaping the APIs and working with the other folks working on Iceberg. We've really gone [in that] direction, like we do with everything, because it's what we saw our largest analytics customers on S3 doing.

"If customers pull us in different directions, we will obviously explore adding support for those things. But for now, Iceberg has emerged as a really attractive direction in terms of its design, but also a popular and well-supported direction for building this kind of structured support on storage."

Late last year, AWS announced S3 Tables, a new type of storage bucket that Warfield described as "a managed Iceberg table. It provides an Iceberg catalog, in which users can create namespaces and tables, each table is a first-class resource. Users can access control policy and security policy on the table itself."

AWS previously said that because the bucket was pre-partition, it would offer a 10x performance boost for access. AWS also automatically runs all of the maintenance and optimization tasks under the covers.

Iceberg originated in 2015 when Netflix had completed its move from an on-premises data warehouse and analytics stack to one based around AWS S3 object storage, which it tried to query via Hive Tables until it hit performance issues and "some very surprising behaviors."

The challenges led the team to develop the Iceberg open table format designed for large-scale analytical workloads while supporting query engines including Spark, Trino, Flink, Presto, Hive, and Impala. It promised to help organizations bring their analytics engine of choice to their data without going through the expense and inconvenience of moving it to a new data store. Iceberg was donated to the Apache Software Foundation as an open source project in November 2018. Since the beginning of 2022, it has won vocal support from data warehouse and data lake big-hitters including Google, Snowflake, and Cloudera.

In 2023, AWS made its first public announcement about Iceberg, previewing support to allow users to employ its cloud-native data warehouse, Redshift, to run analytic queries on Iceberg tables in external data lakes, but only if they were new tables, not tables converted from Parquet to Iceberg.

Warfield said interest in Iceberg began to grow about three years ago as S3 users and AWS grappled with the problem of creating a database-like representation of data in S3. They addressed this by carving out columns and creating a representation in so-called row groups, avoiding having to query the whole file. While the approach created benefits, there was also a cost.

"Parquet became a lot better that way," Warfield said. "We got this much more database-friendly representation of data, but because S3 is immutable, once you wrote your table in Parquet, you couldn't do any of the things that people were used to doing with databases in terms of mutations. You couldn't update it. And so at best, what we were seeing, up to three years ago, before the introduction of OTFs, was that the data was totally static, and people would append by adding additional Parquet files."

Iceberg and other OTFs add a layer of metadata to the Parquet structures. Iceberg creates a root node that points to the current view of the table by storing new metadata typically as JSON files. A new root node can act like a database atomic update as it moves the view of the table that the customer sees of the data.

"You can do these relatively small updates, but you make the table completely mutable," Warfield said. "Two years ago, those conversations with customers shifted to moving from just playing Parquet, sometimes with Hive as a metastore on top of it, to actually like dipping their toes in and doing stuff with Iceberg."

AWS's embodiment of its approach to Iceberg comes with S3 Tables, but also in Sagemaker, the machine learning platform, which has been repositioned to accommodate some aspects of data warehousing, analytics, and data lakes.

"From the S3 storage team's perspective, they are really excited about S3 Tables because anyone with this highly structured data that puts it here suddenly gains the ability to work with it from basically any analytics or machine learning tool and also their own applications. And from Sagemaker's perspective, supporting the Iceberg APIs means that they can now work with not just with S3 and S3 Tables, but also with any data that's stored in Iceberg anywhere," Warfield said.

Since Snowflake, Google, and a raft of other vendors have also jumped in with Iceberg, the move promises to ease integration with projects already started with other technologies. It also has implication for AWS's Redshift, on which customers have been building projects for more than ten years.

The AWS data warehouse has its own approach to storage – Redshift Managed Storage (RMS) – which Warfield said was the cloud vendor's attempt to solve some of the problems OTFs also address. With the Sagemaker Lakehouse Catalog, this data will be open to a broader set of analytics tools outside AWS's portfolio so long as they support the Iceberg APIs.

"With the introduction of Iceberg REST Catalog support inside the Sagemaker Lakehouse Catalog, the analytics team has opened up the ability for RMS to be accessed by any analytics platform, which is a huge improvement in flexibility and access to that data. Conversely, Redshift, through the Iceberg REST Catalog, can work with any Iceberg storage," he said.

In adopting Iceberg across its storage, analytics, and machine learning portfolio, AWS is doing its bit to push Iceberg towards fulfilling its early promise.

"All of this stuff is just really being driven by the resounding voice of lot of our customers who are doing analytics. They have data in all sorts of places and they have teams that have preferences for different tools. There is a lot of new adoption and a huge investment within users to make sure that any tool works with any data, and any data is available to every tool," Warfield said.

Questions remain about Microsoft's approach in its Fabric platform. The omnipresent vendor promises a degree of integration with Iceberg, although Delta is set to remain its native table format.

Databricks has talked about trying to merge Delta and Iceberg, which it admits might take a few years, and in any case would be dependent on Apache's governance of Iceberg, which Databricks does not control.

A former software engineering manager at Apple, where Iceberg is said to be wall-to-wall, said adopting Iceberg as the de facto standard, rather than merging the two standards, would be a better option. Iceberg committer and PMC member Russel Spitzer, who recently joined Snowflake as principal engineer, told The Register in October that he hoped vendors would all use Iceberg under the hood to eliminate table formats as a design point.

Warfield said AWS talked to Databricks since it builds systems on top of S3 and was working to ensure that all of the data that users have on any of these analytics platforms is available to everyone, and able to work on all systems.

But since the cloud giant has renewed its commitment to Iceberg, the ball remains firmly in Databrick's court. ®

More about

TIP US OFF

Send us news


Other stories you might like