Discussion in 'Article Discussion' started by bit-tech, 5 Dec 2019.
Intel needs Rocket Lake in 2020. Comet Lake doesn't look like it does enough to really take it to AMD vs Zen3. 2021 will see the launch of DDR5 and PCIe5, so the idea of less cores and even higher clockspeeds while still at 14nm with Rocket Lake could produce very fast systems, but they will be competing with 5nm Zen4, Can Intel really get a 14nm chip to compete with a 5nm chip? If they can it would be an amazing feat of engineering, but physics makes it hard and unlikely. Typically a process shrink does reduce clockspeeds, so AMD could struggle with getting to 4ghz at 5nm, and 5nm itself could be not very performant, but overall, if everything plays out reasonably normally, I'd expect Rocket Lake to be really struggling in 2021.
The opposite. While power/transistor (and power/unit area) has been dropping since 22nm, that's not true for other measure. Cost/transistor has been rising[ since then, and maximum switching speed dropping. The longer you can stick to any given process node, the cheaper and faster-clocked your chips will be compared to the same chip on a newer node. This is why Intel have been moving mobile chips to 10nm already (because the power/transistor metric is king there), but not desktop chips.
That what I said? Intel will have very high clocked chips on it's mature 14nm vs AMD which could potentially have less than 4ghz at 5nm.
However, the 'but physics makes it hard and unlikely' quote was about making a 14nm chip faster overall than a 5nm chip, considering the vastly increased number of transistors that can fit on a 5nm die. It makes more space available for cache etc and the quote was taking these factors into consideration, rather than clockspeed.
Again, we hit physical limits a while ago here. While you could move to GAAFETs or similar here (like the move to finFETs) these can be applied equally to larger process scales. Transistor size hit the gate oxide limit, metal separation (leading to transistor packing) is hitting the Cu barrier limit (hence everyone trying but not yet achieving a move to Co metal layers).
And even then, adding Moar Transistors does not do all that much good unless you can use them. Desktop applications (or most things outside HPC) are firmly in Amdahl's Law scaling territory, so you can't just use those transistors to stick more and more cores in and expect performance to scale with them (the old quote "bringing a baby to term requires 9 months regardless of the number of women assigned to the task" applies). Bigger more complex cores are one option, but it gets more and more complex to try and wring additional performance with the same instruction set (and performance 'hacks' like speculative execution can have unintended consequences), and getting everyone to use your new instruction set is hard enough (e.g. AVX512) let alone switch to an actually new architecture (the crash and burn of Itanium and HSA).
Then there's cost: x billion transistors simply cost more as you move to smaller processes, and that cost is going up and up with every process shrink. Unless you're hitting the reticle limit (and CoWoS / EMIB / Foveros / etc are looking to alleviate that soon anyway), the same number of transistors will cost less and clock higher on a larger process scale than a smaller one. That cost premium may be worthwhile when you are power-limited as with mobile devices.
I don't understand the point of your posts?
Are you disagreeing with my statement that 14nm Rocket Lake will struggle to compete with 5nm Zen4? You haven't explicitly stated you disagree, yet you seem to be picking at the points I made.
I also take it you didn't read the entire line in my post when talking about what increased transistor counts offer, I didn't mention Moar Cores, I specifically said -
The point I am making is that I believe the advantages of AMD designs on the 5nm node (increased L1 cache, among other things) will outweigh the pure clock speed advantage of 14nm and Rocket Lake will not be able to compete because it is a year too late.
Could you explain the point you are trying to make?
You are just repeating what I was implying in the first post, mature process is cheaper and clocks higher, I don't get it? What are you saying? We should never increase the number of transistors, and we should never use smaller process for fabrication?
Have you considered how big a chip with 9 billion transistors is at 14nm (TSMC) compared with at 5nm (TSMC)?
This is why transistors are added at die shrinks, because they can't be added before without the chip size getting too big. Transistors on a bigger process node generate also generate more heat than on smaller nodes, so the only time it really becomes feasible to add many more transistors is on a process shrink.
Unless you're hitting the reticle limit, you can 'add more cache' on a larger node the same as you can on a smaller node. Unless you're extremely power-constrained a given circuit will perform better on a larger node.
On top of that, performance boosts are not as easy as 'just add more cache'. Increasing implicit cache size if not done very carefully (with large-scale changes to the pipeline and architecture that boil down to proportional cache size not changing after all is said and done) can decrease performance (larger cache = higher access latency = greater impact of a cache miss = overall latency increase and effective memory bandwidth reduction). Adding explicitly addressed isolated caches (e.g. the XB1's ESRAM) gets into the pain-in-the-arse of platform-specific coding. 'Just' adding support for improved vector instructions that you were performing anyway (e.g. AVX512) is hard enough, rearchitecting your code to use an explicit cache while still maintaining a codepath for all the chips that do not have that cache is not going to fly for the same reasons HSA and Itanium failed to gain traction: you need a truly vast (orders of magnitude, not single or double-digit percentages) performance increase to justify that workload.
For most of the time, yes: moving to a smaller node is not a gain in cost or performance, it's a regression. The days of perf-boosts from die-shrinks died close to a decade ago, since then performance gains have been architecture-based in spite of smaller nodes, not because of. "Smaller number means more better" does not apply in practice.
As for why development didn't just stop at the 22nm inflection point? The need for mobile dies driving demand for low-power improvement (where process shrinks make sense) and the use of monolithic dies, along with the costs of running two parallel process nodes, meant moving to a new node and 'taking the hit' of a small performance regression was preferable. Mutliple effects have now stacked up to change the balance: the cost of the node itself is increasing faster and faster, the cost of building fabs for the new node has significantly increased (due to another combination of DUV SAQP and EUV machines being expensive, and the need for new metal processes), and there are multiple viable options for split-die layouts without the performance and power impacts of routing through non-silicon substrate (the power penalty of through-substrate IF is why all Ryzen mobile chips are monolithic dies).
Power/transistor is dropping more slowly than transistor separation distance (the reduction close to halted once the gate oxide limit was reached, the transistors themselves are not getting any smaller) so shrinking nodes have increased power-per-unit-area. Packing in more transistors into a smaller space makes thermals worse, not better.
Please correct me if I am wrong, but this is how I understand what you are saying in simple terms:
5nm Zen4 will only be better than 14nm Rocket Lake if it is a vastly superior architecture? This is because there is no performance benefit from a process shrink to 5nm, so the higher clocks of Rocket Lake will make more of a difference to the final performance of the chip?
I'll have to bow to your superior knowledge on the subject, as you obviously know more about chip design and manufacture than I do. I knew we were seeing limited improvements from process shrinks, but I didn't realise we had hit a regression point, apart from understanding that new process nodes normally now come with clock speed penalties.
Separate names with a comma.