https://www.anandtech.com/show/15838/cerebras-wafer-scale-engine-scores-a-sale-5m-buys-two-for-the-pittsburgh-supercomputing-centerThe TU102 CPU in Nvidia's GeForce RTX 2080 Ti is huge, at about 754 mm^2. It has about 18.6 billion transistors. That makes the cards expensive to build, which is why they cost a fortune.
So Cerebras decided to build a chip that is 46225 mm^2 and has about 1.2 trillion transistors. It also uses 20 kW of power. Take that, Nvidia.
One of the reasons why enormous chips are so expensive and difficult to build is that you get yield problems. At an atomic scale, foundries aren't able to put every single atom exactly where they intended. There is some amount of tolerance where it's okay to be off by a nanometer here and there. But defects outside of your tolerance range will still happen. If you average one defect per 100 mm^2 of silicon, small chips can work fine, but big chips are almost all going to be defective.
You can work around defects to some degree with redundancy. Wherever you needed one of something in your chip, build two instead, with the ability for the chip to use either one. So long as at least one of the two small components works, you're fine. AMD had to do this with TSVs on TSMC's 40 nm process node. Nvidia also needed to do it, but didn't realize it soon enough, which is why the GeForce 400 series GPUs had such awful yields. The problem with such redundancy is that it adds area and hence cost.
You can add redundancy at a larger scale, too. Think of the PS3 having a seven core Cell processor that actually had eight physical cores, one of which was disabled. Or the PS4 having a GPU with 20 physical compute units, 2 of which were disabled to leave a GPU that had 18 working compute units. Which ones did they disable? If one or two were defective, those would be the ones to disable, which allowed them to still sell chips that had some defects.
If you try to build a 46225 mm^2 chip, you're going to have a whole lot of defects. So that means you need a whole lot of redundancy. One defect that you don't have the redundancy to manage means throwing away the entire wafer, and those cost several thousand dollars each.
One question that raises is how? How is this even possible? On TSMC's 28 nm process node, the reticle limit (and hence largest possible chips) was a little over 600 mm^2. Nvidia reportedly paid TSMC a bunch of money to increase that on 12 nm, which allowed chips a little over 800 mm^2. So how do you get from there to 46225 mm^2?
One answer to that is that the wafer has a bunch of "separate" chips that are linked together by some fancy cross-reticle patterning. Normally, you make a wafer with 100 chips in it or whatever, then cut the wafer apart to have the hundred different chips that will go into a hundred different video cards or whatever. Or maybe eighty working chips that go into eighty different video cards, with the other twenty being thrown in the garbage as defective. Cerebras instead leaves the whole wafer intact, and the various different chips can send data to their neighbors, albeit less efficiently than transferring data within a single chip.
Another question is why? Why would anyone do this? As with a lot of the prominent, large ASICs being made these days, the answer is machine learning. There are some problems where you'd like to have a bunch of chips with a ton of bandwidth connecting them. The bandwidth you can get from cross-reticle patterning may be massively less than you can get from the same wafer space doing stuff within a single reticle area, but it's still a whole lot more than you can get from motherboard traces. So for problems where you need ridiculous amounts of bandwidth connecting your chip, this makes sense, at least if you can get it to work properly soon enough and at a reasonable cost.
Ah yes, the cost. Cerebras insists that there is no minimum order size for their machines. If you want just one, you can buy just one. But they cost a few million dollars each. Makes that RTX 2080 Ti for only $1200 look like a relative bargain, doesn't it?
Comments
거북이는 목을 내밀 때 안 움직입니다
Rehoboam, ftw.
Westworld's giant computer.
EQ1, EQ2, SWG, SWTOR, GW, GW2 CoH, CoV, FFXI, WoW, CO, War,TSW and a slew of free trials and beta tests
“Microtransactions? In a single player role-playing game? Are you nuts?”
― CD PROJEKT RED