How much does AMD's packaging expertise matter?

Quizzical · November 2022

One of the problems with manufacturing computer chips is that, while a bunch of chips might be intended to be identical, at an atomic level, there's quite a bit of variation. If the difference between what a chip is intended to be and what actually exists is small enough, it works fine. But if something is wrong by a large enough margin, then it doesn't work properly.

Done naively, one defect means that a chip is useless and has to be thrown into the garbage. Some chips can be partially salvageable. For example, if you make a chip with four CPU cores and one of them doesn't work, you can sell it as a three-core CPU. But whether a chip is salvageable depends on what is defective, as if you don't have redundancy that allows you to work around the defect, then the whole chip is garbage. Putting in enough redundancy that you could work around just about anything will make the chip much larger and more expensive than necessary.

The difficulty of making computer chips work properly scales with the area. More area means that you'll tend to have more defects. You have to be able to work around every single defect that appears in a chip or else the whole chip is garbage.

A typical memory module has eight or so DRAM chips, though the number can vary. Memory modules aren't built by taking eight random chips, sticking them on a module, and then testing to see if the whole thing works. Rather, they can test the memory chips first. If one chip on a module is defective, then you throw away the whole module, including the other chips that are good. If you test memory chips before attaching them to a module and throw away the bad chips before they get used, then you have a much higher rate of assembled memory modules working properly. Instead of one bad DRAM chip meaning that you have to throw away an entire module, including several good chips, you only have to throw away that one bad chip.

So why not do this with CPUs and GPUs? Instead of making one huge GPU chip, make eight small chips that are each 1/8 of a huge GPU. Test each chip, throw away the bad ones, and make a completed GPU out of the good ones. That way, you waste a lot fewer chips as having been defective. It also means that you don't need as much redundancy inside of a chip to work around defects, as you've made it much cheaper to just throw away an entire chip.

The key difference is that the various DRAM chips on a memory module don't need to communicate with each other at all. Whether you make one giant DRAM chip or a bunch of small ones, the total amount of I/O connecting the memory chips to other things (eventually the CPU's memory controller, though there are some other hops along the way) doesn't change. As such, it's easy to make small chips for the sake of good yields, and that's what memory fabs do.

The various parts of a CPU do need to communicate with each other, however. That's easy if they're part of the same chip, but harder if they're separate chips. So long as you don't need very much bandwidth connecting components, you can do it by something akin to how different chips on a motherboard communicate. Intel has built a big CPU out of two smaller ones several times over the decades, most famously in the Core 2 Quad, which had two chips with two cores each.

The downside of having a CPU split between multiple chips is that those chips probably don't communicate with each other very much. It has long been noted that code that scales well to use many CPU cores doesn't always scale well to use multiple CPU sockets. If several threads want to operate on the same memory space, but run on cores that are connected to different physical memory, then you end up copying data back and forth a lot, which kills your performance.

Depending on the particular architecture, having a CPU that is split into multiple chips can create similar problems. AMD's first generation Threadripper and EPYC Naples had these problems in a single socket, due to different chips inside the CPU being connected to different physical memory modules. Newer generations of AMD parts have fixed this issue by giving all CPU cores quicker access to all of the memory.

But that doesn't necessarily work well for a unified, last-level cache. Most CPUs that had multiple cores but only one physical chip have had a unified L3 cache (or earlier, L2 cache) that all of the cores could see. Many of AMD's CPUs with multiple chiplets couldn't do that. Each chiplet could have a unified L3 cache shared by all of the cores on that chiplet (though some earlier AMD chiplets didn't even do that), but that means that switching threads between chiplets would often require reading back data from system memory rather than L3 cache.

That caused some programs to not scale to many cores on AMD CPUs nearly as well as they did on Intel CPUs that had a unified L3 cache available to all cores. In the consumer space, this usually wasn't that big of a problem, though it sometimes meant that an 8-core CPU performed better than a 16-core CPU. In the server market, where a single EPYC processor could have eight CPU chiplets in a single socket, it was a much bigger problem.

For a GPU, it's much harder. Having various GPU chiplets that can't communicate very well means that you have a single GPU inherit all of the traditional problems of CrossFire or SLI. That would be a disaster and completely defeat the point of GPU chiplets.

In order for a multi-chip GPU merely to be functional like a monolithic GPU, the bandwidth requirements are pretty extreme. The bandwidth connecting the graphics/compute die in the recently announced Radeon RX 7900 XTX to the memory/cache dies is 5.3 TB/s. For comparison, the bandwidth to a 3200 MHz DDR4 chip that is one of eight on a module is 3.2 GB/s. Do note the difference between TB/s and GB/s. The former is a much harder problem than the latter. Yes, the 7900 XTX has six memory/cache dies, but that's still slightly below 1 TB/s per die, and the graphics/compute die has to have the full bandwidth to all of the memory/cache dies at once.

Quizzical · November 2022

Furthermore, AMD had to alter their design to bring the required bandwidth connecting chiplets down to 5.3 TB/s. For many years now, GPUs have had a unified L2 cache shared by the entire chip. Perhaps it was banked such that each memory controller had its own chunk or some such, but all of the compute units on the entire chip had full and equal access to all of the L2 cache.

When AMD introduced their 128 MB "Infinity Cache" in the Radeon RX 6900 XT, I had expected it to be a much larger L2 cache. It wasn't. Instead, the chip had a traditional 4 MB L2 cache, together with a 128 MB L3 cache. In contrast, the GeForce RTX 4090 simply made its L2 cache much larger than in the past, at 72 MB.

At the time, I thought it was very weird to have one cache shared by the entire GPU, backed by another, larger cache shared by the entire GPU. CPUs might have separate caches for a single core like that, but that's for purpose of latency. On a GPU, once you're going to L2 cache, latency scarcely matters. In absolute terms, it's likely for a GPU to take longer to retrieve data from on-die L2 cache than for a CPU to retrieve data from DDR4 (or now DDR5) memory.

The announcement of the Radeon RX 7900 XTX explained why. That card moves the larger L3 cache onto the memory/cache dies, while keeping a smaller 4 MB L2 cache on the main graphics/compute die. Why have that L2 cache at all? For the same reason that GPUs have traditionally had L2 caches: so that some memory fetch requests can find their data on the chip without having to jump to another chip. Traditionally, that was so that you could greatly reduce the bandwidth of jumping off the GPU chip to GDDR5 memory or whatever. In the case of the 7900 XTX, it was to reduce the bandwidth needed to jump to memory/cache chiplets. The L3 cache on those chiplets is to reduce the bandwidth needed to jump to GDDR6 memory. Without the L2 cache, the chip would probably need a lot more than 5.3 TB/sec of bandwidth connecting chiplets in order to work properly.

So how does AMD get so much bandwidth connecting those chiplets? The answer is the same way that they've gotten a lot of bandwidth connecting HBM stacks to a GPU chip in the same package: a silicon interposer. Modern, high end GPU chips have hundreds of TB/s of bandwidth connecting their main registers to shaders. That's not all connecting point A to point B. It's a very large number of smaller connections over very short distances in parallel all over the chip. But you can readily get that sort of cumulative bandwidth inside of a logic chip if you need it.

A silicon interposer is basically a very simple logic chip whose only purpose is to have a bunch of bandwidth connecting one chiplet to another. It will use an older process node, only have metal layers, and not have very many metal layers, even. That makes it much cheaper for its area than a normal logic chip. But all of those chiplets in the main GPU have to be stacked on top of another chip that is large enough to fit all of the chiplets onto a single chip. Needing to also have an interposer adds to the cost, as well as the complexity of assembling a completed GPU package and making it actually work.

In a sense, the high-bandwidth silicon interposer isn't completely new. AMD has been using them since their Fiji chip in 2015, where a 1200 mm^2 interposer connected the GPU chip proper to four stacks of HBM memory in the same package. Nvidia has done similarly for their various chips that use HBM2. One big difference is that Nvidia reserves it for high end compute parts that cost several thousand (or sometimes tens of thousands) of dollars each, while AMD has repeatedly used the technology in consumer hardware. Another is that even Nvidia's upcoming H100 SXM part tops out at 3.35 TB/s of bandwidth over the interposer, while AMD's consumer Radeon RX 7900 XTX needs a lot more bandwidth for a part that will surely carry less than 1/10 of the price tag.

So if breaking a large, monolithic chip into smaller chiplets causes so many problems (higher latency, awkward memory hierarchies, extra cost for an interposer, etc.), why use it? Is the benefit of improved yields and the ability to mix process nodes really such a big deal?

The answer to that is: sometimes. For a small chip, you don't need chiplets. AMD still builds monolithic chips for their smaller parts. If a monolithic chip to do what you need would only be 100 mm^2, then there's no need to add all of the extra cost and complexity of chiplets. As the total amount of silicon that you want to squeeze into a package increases, however, the benefits of using chiplets also increase.

One advantage of chiplets is that it allows you to make functional parts with more silicon in the package sooner on a given process node. For example, on Intel 10 nm, they pretended to launch Cannon Lake dual core parts in May 2018, actually launched Ice Lake dual core parts in September 2019, and didn't get to 40-core Ice Lake-SP server parts until April 2021. It's easier to make smaller chips than larger ones with acceptable yields on a given process node.

If you're using chiplets, then you can make huge packages as soon as you can make small chiplets with acceptable yields, without having to wait until you can manage larger chips. AMD's first CPUs on TSMC 5 nm were the Ryzen 7000 series that launched on September 27. By November 10, they were ready to launch EPYC Genoa, with over 1000 mm^2 of silicon in a package, most of it on TSMC 5 nm. Once the 70 mm^2 CPU complex die was ready to go for the consumer Ryzen parts, the same silicon was also ready to be used in server parts.

AMD's chiplet approach has allowed them to make much larger server CPUs, much sooner than Intel. Back when Intel's Whiskey Lake parts topped out at 28 CPU cores per socket, AMD EPYC Rome was able to pack 64 cores in a socket. Intel's Ice Lake-SP got them a little closer at 40 CPU cores, but now AMD EPYC Genoa can do 96 cores in a socket. That allows AMD to offer high end server parts that Intel simply can't compete with.

Quizzical · November 2022

AMD hasn't yet done something analogous with GPUs, but they could. AMD could build a GPU with an enormous graphics/compute die, ten or so memory/cache dies, about 900 mm^2 of total silicon, a 600 W TDP, and offer a much higher level of performance than Nvidia will be able to offer before they move to 3 nm in two years or so. I don't know if AMD will do that, but they could.

Whether they could is, of course, a very different question from whether they should. AMD charges $11,805 for a 96-core EPYC 9654 server processor. If they could charge similar prices for similar amounts of silicon in a GPU and have a large market for it, they surely would. But the gamers who are complaining about $1600 for a GPU surely wouldn't jump at the chance to pay AMD more than seven times that. AMD might have to charge $2000 or $3000 to make good money on such an enormous Radeon card, and the market for gamers willing to pay that may not be large enough to make such a part profitable.

Making giant GPU compute parts is perhaps a more interesting market, and Nvidia likely makes more money doing that than selling GeForce cards. The problem is that there isn't a big market for silicon without good drivers, and AMD's GPU compute drivers are a disaster. If you want to be generous, you could say that AMD's ROCm drivers offer half-baked support for three different APIs, and good support for exactly nothing. They don't even offer as good of compute support as what they used to offer in the old amdgpu-pro drivers that have long since dropped support for OpenCL. Maybe AMD will make a more serious effort at making working GPU compute drivers at some point in the future, but they've been pretending that this was important to them for about 13 or so years now, and the situation hasn't improved much.

The title implies that AMD has far greater packaging expertise than Intel or Nvidia, and that is certainly the case. Intel's efforts at packing multiple chips into a single CPU have mostly gone poorly. The Core 2 Quad was a fine part, but that was an era when the memory controllers were in the chipset, not the CPU, so the two CPU dies only needed to communicate with the chipset and not each other. Cascade Lake-AP was nearly as big of a joke as Cannon Lake. Sapphire Rapids has been tremendously delayed. Intel promises that they'll mimic AMD's chiplet approach with what Intel calls "tiles" in Meteor Lake, but that's much easier than what AMD is doing, and it remains to be seen if Intel can pull it off.

Nvidia has yet to make a multi-chip CPU or GPU at all. They have had several parts with HBM2 on package, so they do know how to use interposers for that. But they haven't demonstrated the ability to do that cheaply, and mediocre yields don't necessarily kill profitability if you can charge $10,000 for each card you sell.

So back to the question of the title: does AMD's packaging expertise matter? In the server CPU market, it certainly does. Similarly for the HEDT market, though that's a small market. In consumer CPUs, however, it surely makes AMD's 12- and 16-core CPUs easier to build. But AMD's CPUs with 8 cores or fewer aren't any better than they would be without chiplets. Their laptop parts generally don't use chiplets at all. Packaging expertise did allow AMD to stack cache for the Ryzen 7 5800X3D, but that has a limited market.

What about the GPU market? It would certainly matter to the GPU compute market if AMD could ever get their drivers in order. But it surely matters far less to the consumer market. How to make monolithic chips with acceptable yields is pretty well understood by now, and even in the 7900 XTX, most of the silicon in the package is in the single graphics/compute die. It could make AMD more competitive at the high end of the consumer market if they want to be, but it's not clear that they want to be. AMD has had plenty of generations where they simply didn't bother to build a large die, leaving Nvidia to have the high end of the market all to themselves.

Even so, there is one benefit to the chiplet approach that I've only obliquely mentioned so far: AMD can use the same chiplets in several different parts. Rather than having several completely independent physical dies that mostly copy considerable chunks of the design from one to another, AMD can just reuse the same chiplets. They can even bin the chiplets so that even semi-defective parts don't need to be thrown away.

For example, AMD has long used exactly the same 8-core CPU chiplets in their consumer CPUs with 6-8 cores, their CPUs with 12-16 cores (two chiplets), their Threadripper HEDT parts, and their EPYC server CPUs. Less widely known is that AMD offers some EPYC server parts that have several chiplets, but only two or four CPU cores active per chiplet, allowing them to sell the chiplets in which more than two cores are unusable.

They could do something analogous with their GPU chiplets. I'm expecting AMD to use exactly the same memory/cache die throughout the Radeon RX 7000 series. Higher end parts will have more of that die, while lower end parts will have fewer. That means no need to integrate their GDDR6 memory controller into several different chips and hope that you don't make any mistakes when copying it around. The memory controllers and L3 cache are done for the whole series, unless they want a refresh to use higher clocked GDDR6 in the future.

Does that justify the extra cost and complexity from needing a silicon interposer? Maybe, but not necessarily. That AMD is going this approach may be evidence that they think it does. That Nvidia is not going this approach is surely evidence that they think it doesn't.

Even so, some parts are partially test parts in preparation for future generations. Intel's 28-core server CPUs were much better than the contemporary 32-core EPYC Naples that was AMD's first generation of chiplet server CPUs, but the latter heralded an era of AMD dominating the top end of the market once they were able to work out the issues. AMD going chiplets for GPUs might pay off handsomely in the future. Or it might not. Predicting the future is hard.

Howdy, Stranger!

How much does AMD's packaging expertise matter?

Comments

Howdy, Stranger!

Quick Links

How much does AMD's packaging expertise matter?

Comments