One of the problems with manufacturing computer chips is that, while a bunch of chips might be intended to be identical, at an atomic level, there's quite a bit of variation. If the difference between what a chip is intended to be and what actually exists is small enough, it works fine. But if something is wrong by a large enough margin, then it doesn't work properly.
Done naively, one defect means that a chip is useless and has to be thrown into the garbage. Some chips can be partially salvageable. For example, if you make a chip with four CPU cores and one of them doesn't work, you can sell it as a three-core CPU. But whether a chip is salvageable depends on what is defective, as if you don't have redundancy that allows you to work around the defect, then the whole chip is garbage. Putting in enough redundancy that you could work around just about anything will make the chip much larger and more expensive than necessary.
The difficulty of making computer chips work properly scales with the area. More area means that you'll tend to have more defects. You have to be able to work around every single defect that appears in a chip or else the whole chip is garbage.
A typical memory module has eight or so DRAM chips, though the number can vary. Memory modules aren't built by taking eight random chips, sticking them on a module, and then testing to see if the whole thing works. Rather, they can test the memory chips first. If one chip on a module is defective, then you throw away the whole module, including the other chips that are good. If you test memory chips before attaching them to a module and throw away the bad chips before they get used, then you have a much higher rate of assembled memory modules working properly. Instead of one bad DRAM chip meaning that you have to throw away an entire module, including several good chips, you only have to throw away that one bad chip.
So why not do this with CPUs and GPUs? Instead of making one huge GPU chip, make eight small chips that are each 1/8 of a huge GPU. Test each chip, throw away the bad ones, and make a completed GPU out of the good ones. That way, you waste a lot fewer chips as having been defective. It also means that you don't need as much redundancy inside of a chip to work around defects, as you've made it much cheaper to just throw away an entire chip.
The key difference is that the various DRAM chips on a memory module don't need to communicate with each other at all. Whether you make one giant DRAM chip or a bunch of small ones, the total amount of I/O connecting the memory chips to other things (eventually the CPU's memory controller, though there are some other hops along the way) doesn't change. As such, it's easy to make small chips for the sake of good yields, and that's what memory fabs do.
The various parts of a CPU do need to communicate with each other, however. That's easy if they're part of the same chip, but harder if they're separate chips. So long as you don't need very much bandwidth connecting components, you can do it by something akin to how different chips on a motherboard communicate. Intel has built a big CPU out of two smaller ones several times over the decades, most famously in the Core 2 Quad, which had two chips with two cores each.
The downside of having a CPU split between multiple chips is that those chips probably don't communicate with each other very much. It has long been noted that code that scales well to use many CPU cores doesn't always scale well to use multiple CPU sockets. If several threads want to operate on the same memory space, but run on cores that are connected to different physical memory, then you end up copying data back and forth a lot, which kills your performance.
Depending on the particular architecture, having a CPU that is split into multiple chips can create similar problems. AMD's first generation Threadripper and EPYC Naples had these problems in a single socket, due to different chips inside the CPU being connected to different physical memory modules. Newer generations of AMD parts have fixed this issue by giving all CPU cores quicker access to all of the memory.
But that doesn't necessarily work well for a unified, last-level cache. Most CPUs that had multiple cores but only one physical chip have had a unified L3 cache (or earlier, L2 cache) that all of the cores could see. Many of AMD's CPUs with multiple chiplets couldn't do that. Each chiplet could have a unified L3 cache shared by all of the cores on that chiplet (though some earlier AMD chiplets didn't even do that), but that means that switching threads between chiplets would often require reading back data from system memory rather than L3 cache.
That caused some programs to not scale to many cores on AMD CPUs nearly as well as they did on Intel CPUs that had a unified L3 cache available to all cores. In the consumer space, this usually wasn't that big of a problem, though it sometimes meant that an 8-core CPU performed better than a 16-core CPU. In the server market, where a single EPYC processor could have eight CPU chiplets in a single socket, it was a much bigger problem.
For a GPU, it's much harder. Having various GPU chiplets that can't communicate very well means that you have a single GPU inherit all of the traditional problems of CrossFire or SLI. That would be a disaster and completely defeat the point of GPU chiplets.
In order for a multi-chip GPU merely to be functional like a monolithic GPU, the bandwidth requirements are pretty extreme. The bandwidth connecting the graphics/compute die in the recently announced Radeon RX 7900 XTX to the memory/cache dies is 5.3 TB/s. For comparison, the bandwidth to a 3200 MHz DDR4 chip that is one of eight on a module is 3.2 GB/s. Do note the difference between TB/s and GB/s. The former is a much harder problem than the latter. Yes, the 7900 XTX has six memory/cache dies, but that's still slightly below 1 TB/s per die, and the graphics/compute die has to have the full bandwidth to all of the memory/cache dies at once.
Comments
Even so, some parts are partially test parts in preparation for future generations. Intel's 28-core server CPUs were much better than the contemporary 32-core EPYC Naples that was AMD's first generation of chiplet server CPUs, but the latter heralded an era of AMD dominating the top end of the market once they were able to work out the issues. AMD going chiplets for GPUs might pay off handsomely in the future. Or it might not. Predicting the future is hard.