Cerebras production capacity is probably fully booked for the next 3 years by big customers. If OpenAI think it is cost efficient, I don't see how Gemini, Grok, Meta can disagree. The thing is, the capacity of chip production is limited by the number of TSMC fabs, and demand for AI is just booming, Cerebras SWE-3 is already sold out for sure. Meta and Google are going to build data centers. Meta plans to spend at least $64 billion, Google $75 billion, and Microsoft $80 billion on facilities and equipment in 2025.
Amazon's "Project Rainier" is an example of massive AI-focused data center development. Chile, New Zealand, Saudi Arabia, and Taiwan will host new data center construction projects this year as the cloud giant races to meet the capacity demands of AI. Let alone China, the biggest AI chip market outside the US.
Curious on where the 4 WSE needed to reach 2100 tokens per second come from? Any source? And also why do you think stringing enough number of slow H100 could reach the fast inference speed? You may conflate throughout with sing user inference speed? Yes, you can string enough slower GPU to get the throughput you desire, but that's a big apple to orange since Cerebras' peformance number is single user speed.
Where did you get the source saying they can't fit? If you are just based on the SRAM they have, it's easy to draw the conclusion that they can't fit. But I don't think that's how it works, not all 400B parameters need to be on the wafers at the same time. The stream in weights as the transformer model work layers by layers. They just need to have enough sram to store one layer of parameters. No?
Thanks for the link! However it doesn't really clarify much. It is the same assumption that you have to load the entire model parameters to SRAM without citing sources, such as someone actually trying it or Cerebras mentioning it explicitly. In fact in the same post, Wang from Cerebras mentioned that they split the model layers across multiple wafers for larger models. Note that each layers is just a fraction of the entire model parameters count. But still that's a very simplisitc view without considering any optimization techniques. It also betrayed the fact that the authors of the article doesn't know how transformer models work nor do they understand how Cerebras load the model parameters weights. Anyway, still searching for the more technical understanding from someone who actually understands it and explains well. No luck yet. Thanks.
Ah I understand your confusion now. Sure you can do what you are suggesting, load weights from HBM. But you won't get Cerebras speeds then. It will be bottlenecked by the 150 Gb/s bandwidth they haven't improved in 5 years.
https://www.cnbc.com/2025/10/03/cerebras-withdraws-ipo-ai.html
OpenAI will spend 10B on custom chips from Broadcom
https://www.ft.com/content/e8cc6d99-d06e-4e9b-a54f-29317fa68d6f
Cerebras production capacity is probably fully booked for the next 3 years by big customers. If OpenAI think it is cost efficient, I don't see how Gemini, Grok, Meta can disagree. The thing is, the capacity of chip production is limited by the number of TSMC fabs, and demand for AI is just booming, Cerebras SWE-3 is already sold out for sure. Meta and Google are going to build data centers. Meta plans to spend at least $64 billion, Google $75 billion, and Microsoft $80 billion on facilities and equipment in 2025.
Amazon's "Project Rainier" is an example of massive AI-focused data center development. Chile, New Zealand, Saudi Arabia, and Taiwan will host new data center construction projects this year as the cloud giant races to meet the capacity demands of AI. Let alone China, the biggest AI chip market outside the US.
Do you have a source on "OpenAI thinks it is cost efficient"?
I am curious, because I actually think the opposite is true.
Any source for the claim that WSE-3 is sold out and Cerebras production capacity fully booked?
Curious on where the 4 WSE needed to reach 2100 tokens per second come from? Any source? And also why do you think stringing enough number of slow H100 could reach the fast inference speed? You may conflate throughout with sing user inference speed? Yes, you can string enough slower GPU to get the throughput you desire, but that's a big apple to orange since Cerebras' peformance number is single user speed.
You can't fit a large model on a single WSE3. That's why you need 4 of them.
Where did you get the source saying they can't fit? If you are just based on the SRAM they have, it's easy to draw the conclusion that they can't fit. But I don't think that's how it works, not all 400B parameters need to be on the wafers at the same time. The stream in weights as the transformer model work layers by layers. They just need to have enough sram to store one layer of parameters. No?
Maybe this will clarify: https://www.nextplatform.com/2024/10/25/cerebras-trains-llama-models-to-leap-over-gpus/
Thanks for the link! However it doesn't really clarify much. It is the same assumption that you have to load the entire model parameters to SRAM without citing sources, such as someone actually trying it or Cerebras mentioning it explicitly. In fact in the same post, Wang from Cerebras mentioned that they split the model layers across multiple wafers for larger models. Note that each layers is just a fraction of the entire model parameters count. But still that's a very simplisitc view without considering any optimization techniques. It also betrayed the fact that the authors of the article doesn't know how transformer models work nor do they understand how Cerebras load the model parameters weights. Anyway, still searching for the more technical understanding from someone who actually understands it and explains well. No luck yet. Thanks.
Ah I understand your confusion now. Sure you can do what you are suggesting, load weights from HBM. But you won't get Cerebras speeds then. It will be bottlenecked by the 150 Gb/s bandwidth they haven't improved in 5 years.
CS3 system I/I BW is 1.2Tb/s (from their website) Where is the 150Gb/s number from ?