https://www.anandtech.com/show/15801/nvidia-announces-ampere-architecture-and-a100-productshttps://videocardz.com/press-release/nvidia-announces-ampere-ga100-gpuIf you think of it as primarily being a GPU, then the paper specs are thoroughly unimpressive, in the sense that you'd hope Nvidia would get a lot more out of a die shrink than that. As compared to a Tesla V100, the clock speed goes down. The number of shaders goes up, so net performance goes up with it. But power consumption is way up, so FLOPS/watt goes down. That Nvidia could pack more than 54 billion transistors into a single die that actually works is impressive in some esoteric sense, but they didn't get a whole lot of GPU benefit out of it.
But I hesitate to call it a GPU. For that matter, I'd hesitate to even call it a compute card. Calling it a tensor computation ASIC isn't quite right, either, though it certainly has a whole lot of that. As best as I can tell, they used about half of the die area for tensor cores.
So what are tensor cores anyway? The idea of tensor cores in Volta/Turing is that that you can do a 4x4 matrix multiply-add of half-precision floats. That is, compute AB + C as matrices (where AB is a matrix multiply), where A, B, and C are all matrices of half-precision (16-bit) floats. Instead of reading data from registers, doing a computation, and writing the output back to registers, you read data once, then do a whole chain of computations where it is hard-wired that the output of one is the input to the next, and then you only write the final output back to registers. That's awesome if the hard-wired chain of computations is exactly what you need, and completely useless otherwise.
Ampere seems to have extended that in several ways. First, based on throughput numbers, it appears they're now using 8x8 matrices rather than 4x4. Second, it can do internal computations as "tensor float32", which basically means this weird 19-bit data type that they made up using a 8-bit exponent (same as a 32-bit float) with a 10-bit mantissa (same as a 16-bit half). That offers more precision than standard half-precision multiplication, but less than a real 32-bit float. Third, they also offer tensor cores with 8-bit and 4-bit integers, for higher throughput yet if very low precision is acceptable. Adding all that extra logic for tensor cores takes a ton of die space.
The general principle that you can't beat an ASIC applies in full force here. If doing matrix multiplication with extremely large matrices of low precision data is your workload, then Nvidia's A100 card is going to be awesome for you. And if that's not your workload, then there's no reason for you to care about the card at all. I'm not aware of any consumer workloads where that's useful at all. It's also completely useless for nearly all compute workloads. Some types of machine learning consist mostly of doing things that fit the new tensor cores very nicely, however, and the card is a near-ASIC for those particular workloads.
The card makes a ton of sense to build if the people doing those particular types of machine learning are willing to spend billions of dollars on dedicated hardware for their particular workloads, and not much sense otherwise. Bitmain made a whole lot of money selling bitcoin-mining ASICs, after all, and there's only one highly specialized workload at all that those can handle.
Several years ago, Nvidia CEO Jensen Huang said something to the effect that, if you want to build a GPU that can do 20 TFLOPS single precision, it's not very hard to do. You just lay out a bunch of shaders and there you go, 20 TFLOPS. But that's a dumb thing to do because the hard part is moving data around so that you can actually put all of that compute to good use. When he said that, there wasn't any GPU on the market that offered even 10 TFLOPS. Now Nvidia is offering 156 TFLOPS single precision, at the cost of it being very, very hard to use. It's as if Nvidia decided to actually build the card that he talked about.
It's also worth pointing out the L2 cache. It's a 40 MB L2 cache. For comparison, the Tesla V100 had 6 MB, which I think was the largest any GPU had ever used before the new A100. That's partially because GPUs have traditionally had the L2 cache be relatively small. For example, the A100 has 27 MB of registers, while the V100 had 20 MB. For another comparison, one of AMD's 74 mm^2 Zen 2 chiplets has 32 MB of L3 cache. But still, that 40 MB of L2 cache takes a significant amount of die space. It's interesting that Nvidia went that route now after not doing so in the past.
What gamers surely want to know is, what does this tell us about Nvidia's upcoming Ampere-based GeForce cards? The answer to that is probably not a whole lot. Having tensor cores eating up a large fraction of the die size and power budget means that we can't really glean much about efficiency. If the GeForce cards go as heavy on tensor cores as the A100 does, then Ampere won't be much of an advance over Turing and will probably be slaughtered by AMD's upcoming Navi 2X. I don't think Nvidia would be dumb enough to do that, but then, I didn't think they'd be dumb enough to put tensor cores in Turing, either. But they were, and it meant learning the hard way that there just aren't very many gamers willing to pay $1200 for a video card that is only 50% faster than a $400 card. We'll see if they double down on that blunder.
Comments
NVidia's Volta and Turing cards had something between 23-26 million transistors per mm^2.
AMD's current Navi cards have about 41 million transistors per mm^2.
Now A100 manages to pack 65.6 million transistors per mm^2, while still keeping the die size just as large as NVidia made with Volta and Turing.
If they can replicate that with their next gen consumer GPUs, we could see huge performance increases thanks to how many transistors their GPUs would pack, though I'm afraid how much those GPUs would cost.
EQ1, EQ2, SWG, SWTOR, GW, GW2 CoH, CoV, FFXI, WoW, CO, War,TSW and a slew of free trials and beta tests
But that leaves the question of, what about traditional purposes of GPUs? You wouldn't want to use this monstrosity for graphics. It probably would outperform anything else on the market today at gaming, but just by taking out the tensor cores, you could get that same gaming performance at something like half the die size, half the power consumption, and a quarter of the cost.
Now, Nvidia has commonly put some compute stuff into their top end GPUs that the rest of the lineup didn't get. But what about the GPU compute market that needs ECC memory or double precision or HBM2? The rest of the traditional GPU lineup isn't going to work for them, and this will be a dumb part for their use. Does Nvidia make a second huge compute GPU in the same generation for non-machine learning problems? Has Nvidia concluded that the GPU compute market other than for machine learning isn't large enough to care about anymore?
And even for those who do want machine learning, does it really have to be a 400 W chip? Yikes! Trying to cool that on air is a bad idea. That's going to need a robust liquid cooling system. And forget about trying to buy one and stick it into a server yourself. Needing weird, custom cooling like that causes all sorts of problems. Look at how popular Intel's 400 W Cascade Lake-AP chips aren't.
Of course, building one weird chip doesn't mean that you can't also build other, more normal ones. I'm hoping that GA100 is just a weird oddball, and that Ampere will also have a more traditional full lineup--including a more normal high end compute GPU. But Nvidia talked about how Ampere is going to be a more unified architecture than Volta/Turing, and every GPU should be able to run everything. It's trivial to emulate tensor cores by using your standard fma, and I'm hoping that's what they meant.
If Nvidia try to put so much tensor junk into their Ampere GPUs all up and down the lineup, then Ampere won't even be much of an advance over Pascal, and will surely get smoked by Navi 2X. That would basically be a rerun of how AMD dominated the market for a few years with their Radeon HD 4000-6000 series cards. This A100 would probably lose at performance per mm^2 in gaming to a GeForce GTX 1080 that launched about four years ago in spite of being a die shrink and having massively higher transistor density.