[ad_1]
Shortly after Amazon CEO Andy Jassy announced AWSās groundbreaking $50 billion investment deal with OpenAI, Amazon invited me on a private tour of the chip development lab at the heart of the deal, at (mostly*) its own expense.Ā
Industry experts are watching Amazonās Trainium chip, created at that facility, for its implications for lower-cost AI inference and, potentially, a dent in Nvidiaās near monopoly.Ā Ā
Curious, I agreed to go.Ā Ā
My tour guides for the day were the labās director, Kristopher King (pictured below right) and director of engineering Mark Carroll (below left), as well as the teamās PR person who arranged the visit, Doron Aronson (pictured with yours truly later in the story).Ā

AWS has been Anthropicās major cloud platform since the AI labās early daysĀ ā a relationship significant enough to survive Anthropic later adding Microsoft as a cloud partner as well, and Amazonās growing partnership with OpenAI.
The OpenAI deal makes AWS the exclusive provider of the model makerās new AI agent builder, Frontier, which could become an important part of OpenAIās business if agents become as big as Silicon Valley thinks they will.Ā Weāll see if that exclusivity stands exactly as announced. The Financial Times reported this week that Microsoft may believe OpenAIās deal with Amazon violates its own deal with OpenAI, namely with Redmond getting access to all of OpenAIās models and tech.
What makes AWS so appealing to OpenAI? As part of this deal, the cloud giant has agreed to supply OpenAI with 2 gigawatts of Trainium computing capacity. This is a giant commitment, given that Anthropic and Amazonās own Bedrock service are already consuming Trainium chips faster than Amazon can produce them.Ā
Techcrunch event
San Francisco, CA
|
October 13-15, 2026
There are 1.4 million Trainium chips deployed across all three generations, and Anthropicās Claude runs on over 1 million of the Trainium2 chips deployed, the company said.
Itās worth noting that while Trainium was originally geared toward faster, cheaper model training (a bigger priority a couple of years ago), itās now tuned and used for inference as well. Inference ā the process of actually running an AI model to generate responses ā is currently the biggest performance bottleneck in the industry.Ā
Case in point: Trainium2 handles the majority of the inference traffic on Amazonās Bedrock service, which supports the building of AI applications by Amazonās many enterprise customers and allows the apps to use multiple models.
āOur customer base is just expanding as fast as we can get capacity out there,ā King said. āBedrock could be as big as EC2 one day,ā he added, referring to AWSās behemoth compute cloud service.Ā

Trainium vs. Nvidia
Beyond offering an alternative to Nvidiaās backlogged, hard-to-acquire GPUs, Amazon says its new chips running on its new specialty Trn3 UltraServers cost up to 50% less to run for comparable performance than using classic cloud servers.Ā
Along with Trainium3, released in December, this AWS team also built new Neuron switches, and Carroll says that combo is transformative.
āWhat that gives us is something huge,ā Carroll said. The switches allow every Trainium3 chip to talk to every other chip in a mesh configuration, reducing latency. āThatās why Trainium3 is breaking all kinds of records,ā particularly in āprice per power,ā he said.Ā
When trillions of tokens a day are involved, such improvements add up.Ā Ā
In fact, Amazonās chip team was lauded by Apple in 2024. In a rare moment of openness for the secretive company, Appleās director of AI publicly described how it used another of the teamās chips ā Graviton,Ā a low-power, ARM-based server CPU and the first breakout chip this team designed. Apple also lauded Inferentia ā a chip specifically designed for inference ā and gave a nod to Trainium, which was new at the time.Ā
These chips represent the classic Amazon playbook: See what people want to buy, then build an in-house alternative that competes on price.Ā
The catch for chips, historically, has been switching costs. Applications written for Nvidiaās chips must be re-architected to work with others ā a time-consuming process that discourages developers from switching.
But the AWS chip team proudly told me that Trainium now supports PyTorch, a popular open source framework for building AI models. That includes many of the ones hosted on Hugging Face, a vast library where developers share open source models.
The transition, Carroll told me, requires ābasically a one-line change, and then recompile, and then run on Trainium.āĀ In other words, Amazon is attempting to chip away at Nvidiaās market dominance wherever possible.
AWS has also this month announced a partnership with Cerebras Systems, integrating that companyās inference chip on servers running Trainium for what Amazon promises will be superpowered, low-latency AI performance.Ā
But Amazonās ambitions go beyond the chips themselves. It also designs the server that hosts the chips. Besides the networking components, this team has designed āNitro,ā a hardware-software combo that provides virtualization tech (which allows many instances of software to run separately on the same server); new state-of-the-art liquid cooling technology; and the server sleds (pictured below) that host this gear.Ā
All of that is to control cost and performance.Ā

Working 24/7 on the ābring-upāĀ
Amazonās custom chip-designing unit was born when the cloud giant bought Israeli chip designer Annapurna Labs in January 2015 for about $350 million. So this team has now had more than 10 years designing chips for AWS.Ā The unit has retained its Annapurna roots and name āĀ its logo is everywhere in the office.Ā
This chip lab is located in a shiny, chrome-windowed building in Austinās upscale āThe Domainā district, a walkable area filled with shops and restaurants thatās sometimes called Austinās Silicon Valley.Ā
The offices have your classic tech corporate vibe: desks in cubicles, gathering spots, and conference rooms. But tucked away at the back of a high floor in the building is the actual lab, with sweeping views of the city.Ā Ā
The shelving-filled lab, about the size of two large conference rooms, is a noisy industrial space thanks to the fans on the equipment. It looks like a cross between a high school shop class and a Hollywood set for a high-end lab, except the engineers are dressed in jeans, not white lab coats.


Note that this is not where the chips are manufactured, so no white hazmat suits were necessary. The Trainium3 is a state-of-the-art 3-nanometer chip, produced by TSMC, arguably the leader in 3-nanometer manufacturing, with other chips produced by Marvell.Ā
But this is the room where the magic of the ābring-upā occurs.Ā Ā
āA silicon bring-up is when you get the chip for the first time, and itās like a big overnight party. You stay here, like a lock-in,ā King explains. After 18 months of work, the chip is activated for the first time to verify it works as designed.Ā The team even filmed some of the Trainium3 bring-up and posted it on YouTube.
Spoiler alert: Itās never problem-free.Ā Ā
For Trainium3, the prototype chip was originally air-cooled, like previous versions. The current chip is now liquid-cooled, which offers energy advantages and was quite an engineering feat.
During the bring-up, the dimensions for how the chip attached to the air-cooling heat sink were off, so the chip couldnāt be activated.Ā
Unfazed, the team āimmediately got a grinder and just started grinding off the metal,ā King said. Because they didnāt want the noise disrupting the bring-up pizza party atmosphere, they snuck off and did the grinding in a conference room.Ā Ā
Staying up all night and solving problems āis what silicon bring-up is all about,ā King said.Ā
The lab even has a welding station, where hardware lab engineer and master welder Isaac Guevara demonstrated welding tiny integrated circuit components through a microscope. This is such insanely difficult work that senior leader Carroll openly admitted he couldnāt do it, to the guffaws of Guevara and the rest of the engineers in the room.Ā

The lab also contains both custom-made and commercial tools for testing and analyzing issues with chips.Ā Hereās signal engineer Arvind Srinivasan demonstrating how the lab tests each tiny component on the chip:

Sleds are the star of the labĀ
But the star of the lab is an entire row showcasing each generation of the āsledsā the team designed.Ā

Sleds are the trays that house the Trainium AI chips, Graviton CPU chips, and supporting boards and components. Stack them together on a rack with the networking component, also custom-designed by this team, and you get the systems that are at the heart of Anthropic Claudeās success.Ā
Hereās the sled that was shown off during the AWS re:invent conference in December:Ā

Proven by Anthropic and OpenAI
I expected my guides to crow about the OpenAI deal during the tour. But they didnāt.Ā
The reticence could have been related to the aforementioned potential legal haze that might hang over the deal. But the sense I got was that these boots-on-the-ground engineers (who are currently designing the next version, Trainium4) havenāt had much chance to work with OpenAI yet. Their day-to-day work has so far been focused on Anthropicās and Amazonās needs.
Currently, the biggest chunk of Trainium2 chips is deployed in Project Rainier ā one of the worldās largest AI compute clusters ā which went live in late 2025 with 500,000 chips. Itās used by Anthropic.Ā
But there was a wall monitor in the main office displaying a quote about how OpenAI will be usingĀ Trainium. The pride was there, if subtle.Ā Ā
In addition to this lab, the team also has its own private data center for quality and testing purposes. A short drive away, it doesnāt run customer workloads, so itās housed at a co-location facility, not an AWS data center.
Security is tight: There are strict protocols to enter the building and to access Amazonās area within.
The data centerās cooling system is so loud that earplugs are mandatory, and the air is thick with the acrid smell of heated metal. Itās not a pleasant place for the average person to hang out.Ā

At this data center, there are rows and rows of servers filled with sleds that integrate all of Amazonās newest custom chips: Graviton CPU, liquid-cooled Trainium3, Amazon Nitro, all happily computing away. The liquid runs on a closed system, meaning it is reused, which should also help reduce the environmental impact, the engineers said.Ā
Hereās what a current Trn3 UltraServer looks like: Multiple sleds are on top and bottom, with the Neuron switches in the middle. Hardware development engineer David Martinez-Darrow is seen here performing maintenance on a sled:

While attention on the team has always been high, the scrutiny has really ratcheted up as of late.Ā
Amazon CEO Andy Jassy keeps a close eye on this lab, publicly bragging about its products like a proud dad. In December, he said Trainium was already a multibillion-dollar business for AWS and called it one piece of AWS tech heās most excited about. He also gave the chip a shout-out when announcing the OpenAI agreement.Ā Ā
The team feels the pressure, too. Engineers will work 24/7 for three to four weeks around each bring-up event to fix any issues so the chips can be mass-produced and put into data centers.
āItās very important that we get as fast as possible to prove that itās actually going to work,ā Carroll said. āSo far, weāve been doing really well.āĀ
*Disclosure: Amazon provided airfare and covered the cost of one night at a local hotel. Honoring its Leadership Principle of Frugality, this was a back-of-the-plane middle seat and a modest room. TechCrunch picked up the other associated travel costs like Ubers and luggage fees. (Yes, I checked a bag for an overnight trip. Iām high maintenance that way.)Ā
[ad_2]
Source link

