In a sense, they already know how to cool a lot more heat than current video cards use: put a water block on the card. What they're trying to do here is to get better cooling for a lot cheaper than that. If they come up with something really good, it can be used in millions of video cards across many years. So whatever innovations Nvidia comes up with doesn't necessarily add very much to the cost of the video card.
That said, higher power consumption does tend to add to the cost of a video card. It's not just the cost of a bigger, heavier cooler, though that is part of it. There's also the power delivery to get that power to the GPU. But perhaps most importantly is that in order to burn so much power without being wasteful, you end up with enormous GPU chips, and those cost a lot of money to build.
There have been video cards in the past that had three 8-pin PCI-E connectors. That can deliver 525 W to a GPU while remaining in spec (or at least keeping each of the power connectors in spec), though it was done largely to allow for overclocking. Apparently Nvidia is now saying that isn't good enough and they're creating a new, 12-pin connector.
The reliability of extremely high power video cards historically isn't very good. The GeForce GTX 590 was perhaps the most infamous case, but the Radeon HD 4870 X2 was quite a problem, too. Toward the end of the dual-GPU card era, AMD completely gave up on air cooling them and just liquid cooled the cards.
One concern with trying to dissipate so much heat is that it's easy to miss something. People who check GPU temperatures are generally checking the GPU chips proper, but those aren't the only things that need to be cooled. The GTX 590 focused too narrowly on keeping the GPU chips cool, and managed to cool them adequately, but had some other parts of the card reach 120 C. The Radeon HD 5970 also focused too narrowly on the GPU chips, and had the VRMs constantly overheat and force the whole card to throttle back. So if Nvidia does put out a 400 W card, there's a pretty considerable risk that they'll botch the cooling on it.
The PCI Express specification gives a maximum power consumption allowed of 300 W. Past attempts at going over that have largely confirmed that having that limit was a good idea. We'll see how this round goes, but Nvidia is certainly aware that it's hard to air cool over 300 W in a usual PCI Express form factor without things going awry.
Historically, high-power GPUs have been driven by inefficient architectures. Nvidia's Fermi architecture is perhaps the most notorious example. AMD's Polaris and Vega had a case of this, too, though without giant chips, it was less obvious. The same is also true in CPUs: consider Intel's NetBurst, AMD's Bulldozer and Piledriver, or more recently (and less egregiously), Intel's Comet Lake.
The problem is that if your architecture is less efficient than your competitor's, you know that you're going to look bad in performance comparisons. But you can squeeze more performance out of the product by clocking it higher than you intended to, at the expense of burning a lot more power. When a company is losing and they know it, they often go that route to make it look like they're not losing as badly.
So in that sense, Nvidia's focus on creating an exotic air cooling system is an ill omen for Ampere. It could easily be a result of Nvidia looking at what they had and realizing that Ampere was going to be less efficient than Navi 2X.
But that might not be the case. A prototypical full node die shrink allows a chip of a given size to have twice as many transistors as before, each of which burns about 70% as much power as the previous transistors. Some quick arithmetic shows that you'd expect the new chip to burn 40% more power than the previous. Having power consumption go up by 40% or so every two years did happen for a while (the original Pentium was a notoriously hot running chip in its day, at 5.5 W), but hardware vendors eventually decided that had to stop, and high-end GPUs generally settled into the 225-300 W range, starting around 2008.
But that pressure to increase power consumption hasn't gone away. There was a lot of efficiency to be had by focusing on making GPUs more power efficient, but the low-hanging fruit there has all long-since been plucked. It's possible that Nvidia has now decided that in order to get the gains that a die shrink should, the power consumption just has to go up like this. And if so, then it's going to stay high forever.
On the power consumption: NVidia's data center sales just exceeded their total gaming sales for the first time ever.
NVidia may be more focused at developing architecture for data centers. If they then modify that architecture to create also consumer cards, a lot of their data center solutions would likely work better on creating huge and powerful consumer GPUs instead of small and cheap GPUs.
On the power consumption: NVidia's data center sales just exceeded their total gaming sales for the first time ever.
NVidia may be more focused at developing architecture for data centers. If they then modify that architecture to create also consumer cards, a lot of their data center solutions would likely work better on creating huge and powerful consumer GPUs instead of small and cheap GPUs.
Also, they are looking to purchase ARM. That points to data centers as where Nvidia really wants to be.
The whole thing just reeks of “more expensive” for something that is already too expensive imo. Everyone wanted prices to come down and from the leaked numbers not only are they not going down, they are going up. Plus you have to consider that once Navi drops, they will release their Ti or super versions so buying the new 3000 upon release is a tough buy for me even though I want it, I don’t NEED it. I don’t chase games with cutting edge graphics and ray tracing. Control and similar don’t appeal to me. I play soulsborne games and world of tanks, Nier Automata lots of Japanese PC ports and stuff so why do I need Ampere?
There won't be any new versions once Navi drops. They're launching so close together that there won't be time for any update versions.
On the power consumption: NVidia's data center sales just exceeded their total gaming sales for the first time ever.
NVidia may be more focused at developing architecture for data centers. If they then modify that architecture to create also consumer cards, a lot of their data center solutions would likely work better on creating huge and powerful consumer GPUs instead of small and cheap GPUs.
I'm not certain that a data center really needs advanced GPU solutions. NVidia may want to sell to data centers, but there doesn't appear to be a 'killer app' demanding a beefy server-side graphics card to my knowledge. If I'm missing something (I've not been in a data center in a few years), please feel free to correct me.
Logic, my dear, merely enables one to be wrong with great authority.
On the power consumption: NVidia's data center sales just exceeded their total gaming sales for the first time ever.
NVidia may be more focused at developing architecture for data centers. If they then modify that architecture to create also consumer cards, a lot of their data center solutions would likely work better on creating huge and powerful consumer GPUs instead of small and cheap GPUs.
I'm not certain that a data center really needs advanced GPU solutions. NVidia may want to sell to data centers, but there doesn't appear to be a 'killer app' demanding a beefy server-side graphics card to my knowledge. If I'm missing something (I've not been in a data center in a few years), please feel free to correct me.
Machine learning is all the rage these days, with some customers spending millions of dollars on hardware for it in data centers. A bunch of companies are working on ASICs for it, but for now, Nvidia GPUs dominate that market. For that matter, the "tensor cores" that are a considerable chunk of the die space in the GV100, TU102, TU104, and TU106 dies and a large chunk of the die space in the A100 die are best thought of as a machine learning ASIC portion of the core.
Machine learning is largely what has driven Nvidia's meteoric rise in the data center space. They used to push GPUs as the solution for a bunch of compute problems, but now, as far as compute goes, they scarcely seem to care about anything other than machine learning. That's most of their data center revenue, so that's what they focus on.
Machine learning is a new enough market that it's still very volatile, and I've long said that once people are building ASICs for a problem, the ASICs are eventually going to win. But Nvidia devoting increasing amounts of die space to what is effectively a machine learning ASIC basically represents their determination to turn their GPUs into the ASIC that wins that battle.
The problem with turning a large chunk of your die into a machine learning ASIC is that that's a ton of wasted die space for everything else besides machine learning, and that's a huge problem if you were hoping to use GPUs for graphics. That's what bloated the die size and cost of the higher end Turing cards. It's probably the primary reason why the A100 die looks on paper like it improved so little over GV100 in spite of having 2.5x the transistor count. And if Nvidia doesn't chop it out of GeForce Ampere cards, wasting a large chunk of the die on something useless to consumers will probably mean that Navi 2X routs Ampere in the consumer graphics space. If you need 600 mm^2 to match your competitor's 400 mm^2 product on the same process node, you lost that generation.
That depends a lot on whether Nvidia chops the tensor cores out of consumer Ampere or not. You'd think that they'd pretty much have to, but I thought that with Turing, too, and they didn't. And Nvidia has said that in their previous generation, they split with Volta for data centers and the very similar Turing architecture for consumer graphics, but with Ampere, it's going to be more unified with all GPUs able to do everything.
In case you're wondering what "tensor cores" are, in Volta/Turing, they could do a matrix multiply-add of 4x4 matrices of half-precision (16-bit) floating-point numbers. That is completely useless for graphics (where matrix multiply is common, but half-precision isn't enough), as well as for literally every compute algorithm I've ever had a serious look at. In Ampere, they doubled down on it, bumping up its capability to handle 8x8 matrices, which effectively means that it needs twice as large of a fraction of the die space.
On the power consumption: NVidia's data center sales just exceeded their total gaming sales for the first time ever.
NVidia may be more focused at developing architecture for data centers. If they then modify that architecture to create also consumer cards, a lot of their data center solutions would likely work better on creating huge and powerful consumer GPUs instead of small and cheap GPUs.
I'm not certain that a data center really needs advanced GPU solutions. NVidia may want to sell to data centers, but there doesn't appear to be a 'killer app' demanding a beefy server-side graphics card to my knowledge. If I'm missing something (I've not been in a data center in a few years), please feel free to correct me.
Machine learning is all the rage these days, with some customers spending millions of dollars on hardware for it in data centers. A bunch of companies are working on ASICs for it, but for now, Nvidia GPUs dominate that market. For that matter, the "tensor cores" that are a considerable chunk of the die space in the GV100, TU102, TU104, and TU106 dies and a large chunk of the die space in the A100 die are best thought of as a machine learning ASIC portion of the core.
Machine learning is largely what has driven Nvidia's meteoric rise in the data center space. They used to push GPUs as the solution for a bunch of compute problems, but now, as far as compute goes, they scarcely seem to care about anything other than machine learning. That's most of their data center revenue, so that's what they focus on.
Machine learning is a new enough market that it's still very volatile, and I've long said that once people are building ASICs for a problem, the ASICs are eventually going to win. But Nvidia devoting increasing amounts of die space to what is effectively a machine learning ASIC basically represents their determination to turn their GPUs into the ASIC that wins that battle.
The problem with turning a large chunk of your die into a machine learning ASIC is that that's a ton of wasted die space for everything else besides machine learning, and that's a huge problem if you were hoping to use GPUs for graphics. That's what bloated the die size and cost of the higher end Turing cards. It's probably the primary reason why the A100 die looks on paper like it improved so little over GV100 in spite of having 2.5x the transistor count. And if Nvidia doesn't chop it out of GeForce Ampere cards, wasting a large chunk of the die on something useless to consumers will probably mean that Navi 2X routs Ampere in the consumer graphics space. If you need 600 mm^2 to match your competitor's 400 mm^2 product on the same process node, you lost that generation.
That depends a lot on whether Nvidia chops the tensor cores out of consumer Ampere or not. You'd think that they'd pretty much have to, but I thought that with Turing, too, and they didn't. And Nvidia has said that in their previous generation, they split with Volta for data centers and the very similar Turing architecture for consumer graphics, but with Ampere, it's going to be more unified with all GPUs able to do everything.
In case you're wondering what "tensor cores" are, in Volta/Turing, they could do a matrix multiply-add of 4x4 matrices of half-precision (16-bit) floating-point numbers. That is completely useless for graphics (where matrix multiply is common, but half-precision isn't enough), as well as for literally every compute algorithm I've ever had a serious look at. In Ampere, they doubled down on it, bumping up its capability to handle 8x8 matrices, which effectively means that it needs twice as large of a fraction of the die space.
So, the killer app that justifies this is relying on this card's matrix math operations? Sounds to me less like these data centers want a dedicated math co-processor instead of a graphics card. Does no one manufacture math co-processors anymore?
Logic, my dear, merely enables one to be wrong with great authority.
"NVIDIA Tensor Cores offer a full range of precisions—TF32, bfloat16, FP16, INT8, and INT4—to provide unmatched versatility and performance"
Other logic in the chip besides tensor cores offers 64-bit precision, and for that matter, 32-bit precision. The "TF32" is a 19-bit floating-point data type that Nvidia made up, with a mantissa the same size as a 16-bit half and an exponent the same size as a 32-bit float.
Graphics uses 32-bit floats very heavily, and can use 16-bit in certain places, mainly in the pixel/fragment shaders. That's why GPUs have been offering heavy amounts of 32-bit floating-point computations at least since the dawn of the GPU compute era in 2006, and probably well before that, and various GPUs have offered double-speed 16-bit half precision computations off and on over the years.
64-bit precision in GPUs isn't new, and isn't tied to tensor cores. Pretty much all GPUs of the last decade could do 64-bit double precision computations, albeit very slowly on many of them. Many high end GPUs in that time offered much faster double precision computations, and the GF100, GF110, Hawaii, GP100, Vega 10, GV100, and Vega 20 GPUs all offered it at half the throughput of floats, just like Nvidia's new A100 chip does.
Tensor Cores in NVIDIA GPUs provide an order-of-magnitude higher performance with reduced precisions like TF32 and FP16.
Yeah but support for 64-bit isn’t the same thing as accelerated for 64-bit
"2.5X boosts for high-performance computing with floating point 64 (FP64)"
That's for NVidia's A100 professional card compared to some previous professional card.
Often only the professional cards offer any proper speed for FP64. Consumer GPUs can do it, but at very heavy slowdown. For example RTX 2080 can do FP64, but only at 1/32th speed of FP32 operations, whereas Titan V that's based on pro card can do FP64 at 1/2 speed of FP32 operations.
On the power consumption: NVidia's data center sales just exceeded their total gaming sales for the first time ever.
NVidia may be more focused at developing architecture for data centers. If they then modify that architecture to create also consumer cards, a lot of their data center solutions would likely work better on creating huge and powerful consumer GPUs instead of small and cheap GPUs.
I'm not certain that a data center really needs advanced GPU solutions. NVidia may want to sell to data centers, but there doesn't appear to be a 'killer app' demanding a beefy server-side graphics card to my knowledge. If I'm missing something (I've not been in a data center in a few years), please feel free to correct me.
Machine learning is all the rage these days, with some customers spending millions of dollars on hardware for it in data centers. A bunch of companies are working on ASICs for it, but for now, Nvidia GPUs dominate that market. For that matter, the "tensor cores" that are a considerable chunk of the die space in the GV100, TU102, TU104, and TU106 dies and a large chunk of the die space in the A100 die are best thought of as a machine learning ASIC portion of the core.
Machine learning is largely what has driven Nvidia's meteoric rise in the data center space. They used to push GPUs as the solution for a bunch of compute problems, but now, as far as compute goes, they scarcely seem to care about anything other than machine learning. That's most of their data center revenue, so that's what they focus on.
Machine learning is a new enough market that it's still very volatile, and I've long said that once people are building ASICs for a problem, the ASICs are eventually going to win. But Nvidia devoting increasing amounts of die space to what is effectively a machine learning ASIC basically represents their determination to turn their GPUs into the ASIC that wins that battle.
The problem with turning a large chunk of your die into a machine learning ASIC is that that's a ton of wasted die space for everything else besides machine learning, and that's a huge problem if you were hoping to use GPUs for graphics. That's what bloated the die size and cost of the higher end Turing cards. It's probably the primary reason why the A100 die looks on paper like it improved so little over GV100 in spite of having 2.5x the transistor count. And if Nvidia doesn't chop it out of GeForce Ampere cards, wasting a large chunk of the die on something useless to consumers will probably mean that Navi 2X routs Ampere in the consumer graphics space. If you need 600 mm^2 to match your competitor's 400 mm^2 product on the same process node, you lost that generation.
That depends a lot on whether Nvidia chops the tensor cores out of consumer Ampere or not. You'd think that they'd pretty much have to, but I thought that with Turing, too, and they didn't. And Nvidia has said that in their previous generation, they split with Volta for data centers and the very similar Turing architecture for consumer graphics, but with Ampere, it's going to be more unified with all GPUs able to do everything.
In case you're wondering what "tensor cores" are, in Volta/Turing, they could do a matrix multiply-add of 4x4 matrices of half-precision (16-bit) floating-point numbers. That is completely useless for graphics (where matrix multiply is common, but half-precision isn't enough), as well as for literally every compute algorithm I've ever had a serious look at. In Ampere, they doubled down on it, bumping up its capability to handle 8x8 matrices, which effectively means that it needs twice as large of a fraction of the die space.
So, the killer app that justifies this is relying on this card's matrix math operations? Sounds to me less like these data centers want a dedicated math co-processor instead of a graphics card. Does no one manufacture math co-processors anymore?
Yes and no. A chip that had Ampere's tensor cores but nothing else wouldn't be useful. Even in heavy machine learning applications where those tensor cores can handle 99% of the computational load, you need to have other stuff to handle the other 1% or else you'll have a huge bottleneck to pass data back and forth between the accelerator and the host and the tensor cores will sit there mostly idle.
You also need various amounts of memory and caches in order to make everything work. Nvidia's A100 accelerator has 32 GB of memory with about 1.5 TB/sec of bandwidth to it, a 40 MB unified L2 cache with several TB/sec of bandwidth, 27 MB of register space with about 160 TB/sec of bandwidth to it, and some other stuff. The caches and bandwidth are mostly needed for graphics, too, but if you try to make a machine learning ASIC, you're going to need a lot of caches to make it work. For comparison, a PCI Express 4.0 x16 slot offers a theoretical 32 GB/sec of bandwidth, so you're going to need nearly all of the data that the GPU uses to come from a local cache on the GPU rather than having to be sent over from the host if you want any semblance of performance.
There are a number of other companies making ASICs for machine learning without any pretense of also being able to do graphics. Nvidia got much of the machine learning market early on because they could just write software to run a lot of algorithms on existing GPUs without having to wait several years for the ASIC to be done. We'll see what happens as a lot of ASICs show up, but I expect that the ASICs will win the bulk of the machine learning market eventually. Of course, Nvidia's compute GPUs devoting more and more die space to what is effectively a machine learning ASIC signals their intent to be the company that builds the ASIC that wins that battle.
The problem comes if Nvidia also uses a lot of die space for machine learning junk in the GPUs that they try to sell for graphics, as that hugely bloats the cost. That was the biggest problem with the higher end Turing cards: take out the tensor cores and you probably take at least 10% off of the price tag of the whole GeForce RTX lineup. If they put the full tensor cores into the GeForce versions of Ampere, it's going to be much larger than a 10% premium, and probably enough to get Ampere crushed by Navi 2X.
The guy is anti-nVidia and anti RTX/DLSR2.0... so... whatever you say
I'm not against ray-tracing. Putting real-time ray-tracing logic into consumer GPUs is an important advance. It's still immature, but it has to start somewhere. That's a totally different issue from tensor cores. You can have either one without the other--and Nvidia's GV100 chip did have tensor cores without ray-tracing.
Even if you think that DLSS is beneficial, you also have to consider that all that die space devoted to the tensor cores that enable it is hardly free. If you had a choice between two otherwise identical GPUs, except that:
1) GPU A can do DLSS 2.0 in the handful of games that support it, or 2) GPU B has 20% higher performance across the board in everything else,
which would you prefer? This is not a trick question. Die space used for tensor cores could have been used for something else, such as more compute units or caches or whatever else you want more of. Or it could have been discarded to just make the die smaller, cheaper, and lower power with otherwise equivalent performance.
And you should also consider that DLSS was created only because putting tensor cores into consumer GPUs was a solution looking for a problem. It's decently likely that as a pure software fix, someone will create an upscaling method that rivals DLSS 2.0 in performance hit and image quality without needing tensor cores at all. While deep neural networks are notorious for hiding things behind such massive mountains of computations that you have no idea what happened, I'm extremely skeptical that Nvidia's implementation does meaningfully better than what they could have done using the simple packed half math if AMD's Vega or Navi or Nvidia's own lower end Turing GPUs.
The trick question would be - how comes nVidia still crushes AMD performance wise if they supposedly waste so much space ?
You're asking why a 754 mm^2 die is faster than a 251 mm^2 die? Do you also find it mysterious that AMD's top end CPUs crush Intel's top end CPUs in code that scales well to use many cores? The better question is, why isn't the die that is three times as big also three times as fast, rather than only being about 50% faster?
By die size, the Radeon RX 5700 XT lands somewhere between a GeForce GTX 1650 and a GeForce GTX 1660 Ti. By performance, it usually beats a GeForce RTX 2070 that has a die about 80% larger. Take out the tensor cores and those GeForce RTX cards could have dies that are a lot smaller and cheaper.
Nvidia has dominated the top end if you care only about raw performance and not efficiency for most of the last 14 years simply because they're willing to build larger dies than AMD. Largest GPU dies in the unified shader era (since 2006):
If you just want the highest performance and don't care about the price tag, then you're usually going to want Nvidia, just because they're willing to build the huge dies you need. If you're not terribly interested in paying $500 for a video card, then few of the GPU dies listed above were ever accessible to you.
By die size, the Radeon RX 5700 XT lands somewhere between a GeForce GTX 1650 and a GeForce GTX 1660 Ti. By performance, it usually beats a GeForce RTX 2070 that has a die about 80% larger. Take out the tensor cores and those GeForce RTX cards could have dies that are a lot smaller and cheaper.
Now you're being a bit unfair to NVidia.
RX 5700 XT has 24% less transistors than 2070 Super, and it loses to 2070 Super by maybe 10-15%. AMD is getting a bit more performance out of each transistor (without ray-tracing), but the difference isn't anywhere as big as you make it seem out.
NVidia's dies are at the moment larger than AMD also because NVidia is buying an older process, and presumably getting same sized die cheaper, than AMD who bought the newer process that can pack transistors to smaller die.
By die size, the Radeon RX 5700 XT lands somewhere between a GeForce GTX 1650 and a GeForce GTX 1660 Ti. By performance, it usually beats a GeForce RTX 2070 that has a die about 80% larger. Take out the tensor cores and those GeForce RTX cards could have dies that are a lot smaller and cheaper.
Now you're being a bit unfair to NVidia.
RX 5700 XT has 24% less transistors than 2070 Super, and it loses to 2070 Super by maybe 10-15%. AMD is getting a bit more performance out of each transistor (without ray-tracing), but the difference isn't anywhere as big as you make it seem out.
NVidia's dies are at the moment larger than AMD also because NVidia is buying an older process, and presumably getting same sized die cheaper, than AMD who bought the newer process that can pack transistors to smaller die.
The 2070 Super is a different and much larger die than the 2070.
Yes, it's partially a process node difference. But my point remains that Nvidia is frequently build enormous dies and AMD is not.
By die size, the Radeon RX 5700 XT lands somewhere between a GeForce GTX 1650 and a GeForce GTX 1660 Ti. By performance, it usually beats a GeForce RTX 2070 that has a die about 80% larger. Take out the tensor cores and those GeForce RTX cards could have dies that are a lot smaller and cheaper.
Now you're being a bit unfair to NVidia.
RX 5700 XT has 24% less transistors than 2070 Super, and it loses to 2070 Super by maybe 10-15%. AMD is getting a bit more performance out of each transistor (without ray-tracing), but the difference isn't anywhere as big as you make it seem out.
NVidia's dies are at the moment larger than AMD also because NVidia is buying an older process, and presumably getting same sized die cheaper, than AMD who bought the newer process that can pack transistors to smaller die.
The 2070 Super is a different and much larger die than the 2070.
Yes, it's partially a process node difference. But my point remains that Nvidia is frequently build enormous dies and AMD is not.
I took one of the Super -models because when you compare how effective AMD is you should compare them to NVidia's newest models instead of making NVidia look bad by picking up some old NVidia model that's still for sale.
But you're right I didn't notice they had changed die. Sorry. In that case the comparison should have to top product made using that die:
RX 5700 XT has 24% less transistors than 2080 Super, and it loses to 2080 Super by maybe 20%. So AMD's a getting a bit more performance out of each transistor (without ray-tracing), but the difference is marginal.
Turing's tensor cores use 4x4 matrices at half precision. Ampere's do 8x8 matrices at half precision. That roughly doubles the amount of die size per compute unit, even before you consider that Ampere has more precision options available.
In the A100 that is going to be used heavily for machine learning, putting the tensor cores in makes a ton of sense. It's putting them into consumer cards that is the problem.
This means a price drop for the 2k series. I have been waiting to buy that tier.
Not necessarily. You might see some liquidation sales as retailers who goofed and bought too many cards try to get rid of them, but that will be brief, and won't happen at all unless there is excess inventory to unload. The GeForce RTX 2000 series cards are very expensive to build, so they're never going to be cheap at retail. Ampere will probably enable Nvidia to build something with equivalent performance for considerably less.
By die size, the Radeon RX 5700 XT lands somewhere between a GeForce GTX 1650 and a GeForce GTX 1660 Ti. By performance, it usually beats a GeForce RTX 2070 that has a die about 80% larger. Take out the tensor cores and those GeForce RTX cards could have dies that are a lot smaller and cheaper.
Now you're being a bit unfair to NVidia.
RX 5700 XT has 24% less transistors than 2070 Super, and it loses to 2070 Super by maybe 10-15%. AMD is getting a bit more performance out of each transistor (without ray-tracing), but the difference isn't anywhere as big as you make it seem out.
NVidia's dies are at the moment larger than AMD also because NVidia is buying an older process, and presumably getting same sized die cheaper, than AMD who bought the newer process that can pack transistors to smaller die.
The 2070 Super is a different and much larger die than the 2070.
Yes, it's partially a process node difference. But my point remains that Nvidia is frequently build enormous dies and AMD is not.
I took one of the Super -models because when you compare how effective AMD is you should compare them to NVidia's newest models instead of making NVidia look bad by picking up some old NVidia model that's still for sale.
But you're right I didn't notice they had changed die. Sorry. In that case the comparison should have to top product made using that die:
RX 5700 XT has 24% less transistors than 2080 Super, and it loses to 2080 Super by maybe 20%. So AMD's a getting a bit more performance out of each transistor (without ray-tracing), but the difference is marginal.
I would agree if the AMD drivers weren't such crap. Nvidia dropped the ball on the 20XX cards also but they fixed it. AMD really hasn't.
This means a price drop for the 2k series. I have been waiting to buy that tier.
Not necessarily. You might see some liquidation sales as retailers who goofed and bought too many cards try to get rid of them, but that will be brief, and won't happen at all unless there is excess inventory to unload. The GeForce RTX 2000 series cards are very expensive to build, so they're never going to be cheap at retail. Ampere will probably enable Nvidia to build something with equivalent performance for considerably less.
The precise meaning of TDP is fuzzy, so it's possible that some cards are playing worse shenanigans there than others. But if the leak is accurate, they're claiming that theoretical performance per watt actually went down as compared to some Turing cards, in spite of the die shrink. That would explain the need to blow out the power budget in order to offer increased performance.
AMD is claiming that Navi 2X will offer a 50% increase in performance per watt as compared to Navi. That would put its paper performance per watt well ahead of an RTX 3090 and give it a decent chance at beating an RTX 3090 outright, depending on how big of a die AMD is willing to build.
Of course, there is a lot more to GPU performance than just raw TFLOPS. But when comparing Turing and Navi, performance was pretty well correlated with theoretical TFLOPS. Older GCN-based cards, including Polaris and Vega, offered considerably less performance for a given TFLOPS than Maxwell, Pascal, Turing, or Navi, so this can vary quite a bit by architecture. I'm skeptical that Ampere will offer enormous advances in whatever the GPU version of IPC is, but we'll see what they come up with.
It also claims GDDR6X memory, in line with previous rumors. I'll say once again, if that's true, it's going to be a very soft launch, if not completely a paper launch for quite some time.
Comments
Not counting the whole new PC you have to buy to house it.
That said, higher power consumption does tend to add to the cost of a video card. It's not just the cost of a bigger, heavier cooler, though that is part of it. There's also the power delivery to get that power to the GPU. But perhaps most importantly is that in order to burn so much power without being wasteful, you end up with enormous GPU chips, and those cost a lot of money to build.
There have been video cards in the past that had three 8-pin PCI-E connectors. That can deliver 525 W to a GPU while remaining in spec (or at least keeping each of the power connectors in spec), though it was done largely to allow for overclocking. Apparently Nvidia is now saying that isn't good enough and they're creating a new, 12-pin connector.
The reliability of extremely high power video cards historically isn't very good. The GeForce GTX 590 was perhaps the most infamous case, but the Radeon HD 4870 X2 was quite a problem, too. Toward the end of the dual-GPU card era, AMD completely gave up on air cooling them and just liquid cooled the cards.
One concern with trying to dissipate so much heat is that it's easy to miss something. People who check GPU temperatures are generally checking the GPU chips proper, but those aren't the only things that need to be cooled. The GTX 590 focused too narrowly on keeping the GPU chips cool, and managed to cool them adequately, but had some other parts of the card reach 120 C. The Radeon HD 5970 also focused too narrowly on the GPU chips, and had the VRMs constantly overheat and force the whole card to throttle back. So if Nvidia does put out a 400 W card, there's a pretty considerable risk that they'll botch the cooling on it.
The PCI Express specification gives a maximum power consumption allowed of 300 W. Past attempts at going over that have largely confirmed that having that limit was a good idea. We'll see how this round goes, but Nvidia is certainly aware that it's hard to air cool over 300 W in a usual PCI Express form factor without things going awry.
Historically, high-power GPUs have been driven by inefficient architectures. Nvidia's Fermi architecture is perhaps the most notorious example. AMD's Polaris and Vega had a case of this, too, though without giant chips, it was less obvious. The same is also true in CPUs: consider Intel's NetBurst, AMD's Bulldozer and Piledriver, or more recently (and less egregiously), Intel's Comet Lake.
The problem is that if your architecture is less efficient than your competitor's, you know that you're going to look bad in performance comparisons. But you can squeeze more performance out of the product by clocking it higher than you intended to, at the expense of burning a lot more power. When a company is losing and they know it, they often go that route to make it look like they're not losing as badly.
So in that sense, Nvidia's focus on creating an exotic air cooling system is an ill omen for Ampere. It could easily be a result of Nvidia looking at what they had and realizing that Ampere was going to be less efficient than Navi 2X.
But that might not be the case. A prototypical full node die shrink allows a chip of a given size to have twice as many transistors as before, each of which burns about 70% as much power as the previous transistors. Some quick arithmetic shows that you'd expect the new chip to burn 40% more power than the previous. Having power consumption go up by 40% or so every two years did happen for a while (the original Pentium was a notoriously hot running chip in its day, at 5.5 W), but hardware vendors eventually decided that had to stop, and high-end GPUs generally settled into the 225-300 W range, starting around 2008.
But that pressure to increase power consumption hasn't gone away. There was a lot of efficiency to be had by focusing on making GPUs more power efficient, but the low-hanging fruit there has all long-since been plucked. It's possible that Nvidia has now decided that in order to get the gains that a die shrink should, the power consumption just has to go up like this. And if so, then it's going to stay high forever.
NVidia may be more focused at developing architecture for data centers. If they then modify that architecture to create also consumer cards, a lot of their data center solutions would likely work better on creating huge and powerful consumer GPUs instead of small and cheap GPUs.
Logic, my dear, merely enables one to be wrong with great authority.
Machine learning is largely what has driven Nvidia's meteoric rise in the data center space. They used to push GPUs as the solution for a bunch of compute problems, but now, as far as compute goes, they scarcely seem to care about anything other than machine learning. That's most of their data center revenue, so that's what they focus on.
Machine learning is a new enough market that it's still very volatile, and I've long said that once people are building ASICs for a problem, the ASICs are eventually going to win. But Nvidia devoting increasing amounts of die space to what is effectively a machine learning ASIC basically represents their determination to turn their GPUs into the ASIC that wins that battle.
The problem with turning a large chunk of your die into a machine learning ASIC is that that's a ton of wasted die space for everything else besides machine learning, and that's a huge problem if you were hoping to use GPUs for graphics. That's what bloated the die size and cost of the higher end Turing cards. It's probably the primary reason why the A100 die looks on paper like it improved so little over GV100 in spite of having 2.5x the transistor count. And if Nvidia doesn't chop it out of GeForce Ampere cards, wasting a large chunk of the die on something useless to consumers will probably mean that Navi 2X routs Ampere in the consumer graphics space. If you need 600 mm^2 to match your competitor's 400 mm^2 product on the same process node, you lost that generation.
That depends a lot on whether Nvidia chops the tensor cores out of consumer Ampere or not. You'd think that they'd pretty much have to, but I thought that with Turing, too, and they didn't. And Nvidia has said that in their previous generation, they split with Volta for data centers and the very similar Turing architecture for consumer graphics, but with Ampere, it's going to be more unified with all GPUs able to do everything.
In case you're wondering what "tensor cores" are, in Volta/Turing, they could do a matrix multiply-add of 4x4 matrices of half-precision (16-bit) floating-point numbers. That is completely useless for graphics (where matrix multiply is common, but half-precision isn't enough), as well as for literally every compute algorithm I've ever had a serious look at. In Ampere, they doubled down on it, bumping up its capability to handle 8x8 matrices, which effectively means that it needs twice as large of a fraction of the die space.
Logic, my dear, merely enables one to be wrong with great authority.
"NVIDIA Tensor Cores offer a full range of precisions—TF32, bfloat16, FP16, INT8, and INT4—to provide unmatched versatility and performance"
Other logic in the chip besides tensor cores offers 64-bit precision, and for that matter, 32-bit precision. The "TF32" is a 19-bit floating-point data type that Nvidia made up, with a mantissa the same size as a 16-bit half and an exponent the same size as a 32-bit float.
Graphics uses 32-bit floats very heavily, and can use 16-bit in certain places, mainly in the pixel/fragment shaders. That's why GPUs have been offering heavy amounts of 32-bit floating-point computations at least since the dawn of the GPU compute era in 2006, and probably well before that, and various GPUs have offered double-speed 16-bit half precision computations off and on over the years.
64-bit precision in GPUs isn't new, and isn't tied to tensor cores. Pretty much all GPUs of the last decade could do 64-bit double precision computations, albeit very slowly on many of them. Many high end GPUs in that time offered much faster double precision computations, and the GF100, GF110, Hawaii, GP100, Vega 10, GV100, and Vega 20 GPUs all offered it at half the throughput of floats, just like Nvidia's new A100 chip does.
Often only the professional cards offer any proper speed for FP64. Consumer GPUs can do it, but at very heavy slowdown. For example RTX 2080 can do FP64, but only at 1/32th speed of FP32 operations, whereas Titan V that's based on pro card can do FP64 at 1/2 speed of FP32 operations.
You also need various amounts of memory and caches in order to make everything work. Nvidia's A100 accelerator has 32 GB of memory with about 1.5 TB/sec of bandwidth to it, a 40 MB unified L2 cache with several TB/sec of bandwidth, 27 MB of register space with about 160 TB/sec of bandwidth to it, and some other stuff. The caches and bandwidth are mostly needed for graphics, too, but if you try to make a machine learning ASIC, you're going to need a lot of caches to make it work. For comparison, a PCI Express 4.0 x16 slot offers a theoretical 32 GB/sec of bandwidth, so you're going to need nearly all of the data that the GPU uses to come from a local cache on the GPU rather than having to be sent over from the host if you want any semblance of performance.
There are a number of other companies making ASICs for machine learning without any pretense of also being able to do graphics. Nvidia got much of the machine learning market early on because they could just write software to run a lot of algorithms on existing GPUs without having to wait several years for the ASIC to be done. We'll see what happens as a lot of ASICs show up, but I expect that the ASICs will win the bulk of the machine learning market eventually. Of course, Nvidia's compute GPUs devoting more and more die space to what is effectively a machine learning ASIC signals their intent to be the company that builds the ASIC that wins that battle.
The problem comes if Nvidia also uses a lot of die space for machine learning junk in the GPUs that they try to sell for graphics, as that hugely bloats the cost. That was the biggest problem with the higher end Turing cards: take out the tensor cores and you probably take at least 10% off of the price tag of the whole GeForce RTX lineup. If they put the full tensor cores into the GeForce versions of Ampere, it's going to be much larger than a 10% premium, and probably enough to get Ampere crushed by Navi 2X.
Even if you think that DLSS is beneficial, you also have to consider that all that die space devoted to the tensor cores that enable it is hardly free. If you had a choice between two otherwise identical GPUs, except that:
1) GPU A can do DLSS 2.0 in the handful of games that support it, or
2) GPU B has 20% higher performance across the board in everything else,
which would you prefer? This is not a trick question. Die space used for tensor cores could have been used for something else, such as more compute units or caches or whatever else you want more of. Or it could have been discarded to just make the die smaller, cheaper, and lower power with otherwise equivalent performance.
And you should also consider that DLSS was created only because putting tensor cores into consumer GPUs was a solution looking for a problem. It's decently likely that as a pure software fix, someone will create an upscaling method that rivals DLSS 2.0 in performance hit and image quality without needing tensor cores at all. While deep neural networks are notorious for hiding things behind such massive mountains of computations that you have no idea what happened, I'm extremely skeptical that Nvidia's implementation does meaningfully better than what they could have done using the simple packed half math if AMD's Vega or Navi or Nvidia's own lower end Turing GPUs.
By die size, the Radeon RX 5700 XT lands somewhere between a GeForce GTX 1650 and a GeForce GTX 1660 Ti. By performance, it usually beats a GeForce RTX 2070 that has a die about 80% larger. Take out the tensor cores and those GeForce RTX cards could have dies that are a lot smaller and cheaper.
Nvidia has dominated the top end if you care only about raw performance and not efficiency for most of the last 14 years simply because they're willing to build larger dies than AMD. Largest GPU dies in the unified shader era (since 2006):
Nvidia A100: 826 mm^2
Nvidia GV100: 815 mm^2
Nvidia TU102: 754 mm^2
Nvidia GP100: 610 mm^2
Nvidia GM200: 601 mm^2
AMD Fiji: 596 mm^2
Nvidia GT200: 576 mm^2
Nvidia GK110: 561 mm^2
Nvidia TU104: 545 mm^2
Nvidia GF100: 529 mm^2
Nvidia GF110: 520 mm^2
AMD Vega 10: 486 mm^2
Nvidia G80: 484 mm^2
Nvidia GP102: 471 mm^2
Nvidia GT200B: 470 mm^2
Nvidia TU106: 445 mm^2
If you just want the highest performance and don't care about the price tag, then you're usually going to want Nvidia, just because they're willing to build the huge dies you need. If you're not terribly interested in paying $500 for a video card, then few of the GPU dies listed above were ever accessible to you.
RX 5700 XT has 24% less transistors than 2070 Super, and it loses to 2070 Super by maybe 10-15%. AMD is getting a bit more performance out of each transistor (without ray-tracing), but the difference isn't anywhere as big as you make it seem out.
NVidia's dies are at the moment larger than AMD also because NVidia is buying an older process, and presumably getting same sized die cheaper, than AMD who bought the newer process that can pack transistors to smaller die.
Yes, it's partially a process node difference. But my point remains that Nvidia is frequently build enormous dies and AMD is not.
But you're right I didn't notice they had changed die. Sorry. In that case the comparison should have to top product made using that die:
RX 5700 XT has 24% less transistors than 2080 Super, and it loses to 2080 Super by maybe 20%. So AMD's a getting a bit more performance out of each transistor (without ray-tracing), but the difference is marginal.
In the A100 that is going to be used heavily for machine learning, putting the tensor cores in makes a ton of sense. It's putting them into consumer cards that is the problem.
https://videocardz.com/newz/gainward-geforce-rtx-3090-and-rtx-3080-phoenix-leaked-specs-confirmed
As with any other leaks, I don't know if it's real. But if it's a forgery, then it's at least a pretty good forgery.
The specs are brutal, though. With boost clocks listed and some older cards for comparison:
Card shaders clock TFLOPS TDP GF/watt
GeForce RTX 3090 5248 1725 18.1 350 51.7
GeForce RTX 3080 4352 1750 15.2 320 47.6
GeForce RTX 2080 Ti 4352 1545 13.4 250 53.8
GeForce RTX 2080 Super 3072 1815 11.1 250 44.6
Radeon RX 5700 XT 2560 1950 9.8 225 43.3
The precise meaning of TDP is fuzzy, so it's possible that some cards are playing worse shenanigans there than others. But if the leak is accurate, they're claiming that theoretical performance per watt actually went down as compared to some Turing cards, in spite of the die shrink. That would explain the need to blow out the power budget in order to offer increased performance.
AMD is claiming that Navi 2X will offer a 50% increase in performance per watt as compared to Navi. That would put its paper performance per watt well ahead of an RTX 3090 and give it a decent chance at beating an RTX 3090 outright, depending on how big of a die AMD is willing to build.
Of course, there is a lot more to GPU performance than just raw TFLOPS. But when comparing Turing and Navi, performance was pretty well correlated with theoretical TFLOPS. Older GCN-based cards, including Polaris and Vega, offered considerably less performance for a given TFLOPS than Maxwell, Pascal, Turing, or Navi, so this can vary quite a bit by architecture. I'm skeptical that Ampere will offer enormous advances in whatever the GPU version of IPC is, but we'll see what they come up with.
It also claims GDDR6X memory, in line with previous rumors. I'll say once again, if that's true, it's going to be a very soft launch, if not completely a paper launch for quite some time.