AWS opens cluster of 40K Trainium AI accelerators to researchers

Throwing novel hardware at academia. It's a tale as old as time

Tobias Mann Tue 12 Nov 2024 // 22:30 UTC

Amazon wants more people building applications and frameworks for its custom Trainium accelerators and is making up to 40,000 chips available to university researchers under a $110 million initiative announced on Tuesday.

Dubbed "Build on Trainium," the program will provide compute hours to AI academics developing new algorithms, looking to increase accelerator performance, or scale compute across large distributed systems.

"A researcher might invent a new model architecture or a new performance optimization technique, but they may not be able to afford the high-performance computing resources required for a large-scale experiment," AWS explained in a recent blog post.

And perhaps more importantly, the fruits of this labor are expected to be open-sourced by researchers and developers so that they can benefit the machine learning ecosystem as a whole.

As altruistic as this all might sound, it's to Amazon's benefit: The cloud giant's custom silicon, which now spans the gamut from CPUs and SmartNICs to dedicated AI training and inference accelerators, was originally designed to improve the efficiency of its internal workloads.

Developing low-level application frameworks and kernels isn't a big ask for such a large company. However, things get trickier when you start opening up the hardware to the public, which in large part lacks these resources and expertise, necessitating a higher degree of abstraction. This is why we've seen many Intel, AMD, and others gravitate toward frameworks like PyTorch or TensorFlow to hide the complexity associated with low-level coding. We've certainly seen this with AWS products like SageMaker.

Researchers, on the other hand, are often more than willing to dive into low-level hardware if it means extracting additional performance, uncovering hardware-specific optimizations, or simply getting access to the compute necessary to move their research forward. What was it they say about necessity being the mother of invention?

"The knobs of flexibility built into the architecture at every step make it a dream platform from a research perspective," Christopher Fletcher, an associate professor at the University of California at Berkeley, said of Trainium in a statement.

It isn't clear from the announcement whether all 40,000 of those accelerators are its first or second generation parts. We'll update if we hear back on this.

The second generation parts, announced roughly a year ago during Amazon's Re:Invent event, saw the company shift focus toward everyone's favorite flavor of AI: large language models. As we reported at the time, Trainium2 is said to deliver 4x faster training performance than its predecessor and boost memory capacity by threefold.

Since any innovations uncovered by researchers — optimized compute kernels for domain-specific machine learning tasks, for example — will be open-sourced under the Build on Trainium program, Amazon stands to benefit from its crowdsourcing of software development.

Naturally, throwing hardware at academics is a tale as old as university computer science programs, and to support these efforts, Amazon is extending access to technical education and enablement programs to get researchers up to speed. This will be handled through a partnership with the Neuron Data Science community, an organization led by Amazon's Annapurna Labs team. ®

On-Prem

Systems

AWS opens cluster of 40K Trainium AI accelerators to researchers

Throwing novel hardware at academia. It's a tale as old as time

Amazon splashes $11B on AI datacenters in Georgia

Biden said to weigh global limits on AI exports in 11th-hour trade war blitz

Can AWS really fix AI hallucination? We talk to head of Automated Reasoning Byron Cook

Looming energy crunch makes future uncertain for datacenters

UK unveils plans to mainline AI into the veins of the nation

Schneider Electric warns of future where datacenters eat the grid

Where does Microsoft's NPU obsession leave Nvidia's AI PC ambitions?

Microsoft eggheads say AI can never be made secure – after testing Redmond's own products

Ransomware crew abuses AWS native encryption, sets data-destruct timer for 7 days

Just as your LLM once again goes off the rails, Cisco, Nvidia are at the door smiling

Microsoft, PC makers cut prices of Copilot+ gear in Europe, analyst stats confirm

We’re paying for what we don’t get: East D.C. neighbors frustrated with Amazon’s Prime delivery exclusions

About Us

Our Websites

You Privacy