Discussion in 'Article Discussion' started by bit-tech, 6 Feb 2019.
I'd say AMD are being aggressive and taking a proactive approach, there's only so much they can do and we know AMD were aware of the issue Level1Tech identified since at least the initial launch of (iirc) the 2990WX and the same with the Nvidia issue, was it something like 2-3 month after launch?
It seems to be more of a lead issue on engineering samples but as IDK what's the typical time between ES' being sent and a release i maybe talking rubbish.
Either way I've just made myself a big tub of popcorn.
They didn't reinvent the way glue works for Zen 2 just for fun.
I don't understand ... explain please
With Zen 1 they have the Infinity Fabric connecting the individual modules
With Zen 2 they send all chiplet to chiplet as well as CPU to system data through a separate chip
Why such a big change in just two years from a company that needs to squeeze every penny 7 times?
Logic dictates that they would have never bothered with the R&D effort to completely change the way the individuals modules connect if they wouldn't have been aware of severe issues with the old approach.
Zen 1&2 still use Infinity Fabric, IF is just the name given to a collection of two 'fabrics' and each of those can be further broken down into on and off package communications.
With Zen 2 IF is still used between chiplets and I/O die, the only thing to really change is what die contains what.
AMD are using the Chiplet system (Which uses Infinity Fabric between them) for better yields. The main IO die is a (reliable) 14nm part, which they can produce on a separate wafer. The chiplets are 7nm (which is not currently as reliable) and considerably smaller, so they can discard unusable chiplets (or lower-clocking ones) and still retain a much larger portion of the wafer, improving yields.
It's quite possible the issue that NV identified and corrected could be related to the one CorePrio is showing up. Either way, this is a huge flaw in the Windows Kernel, something AMD has very little control over. Chances are they did know about it, but MS wouldn't fix it as "Couldn't reproduce", but now L1T has done the groundwork and given us CorePrio, MS finally has no option but to actually fix their sh... er... stuff.
Think about it:
Zen 1 = Issues with windows scheduler
What does Windows love above anything else? Providing compatibility to obsolete tech like the traditional northbridge
AMD decides to stick what is effectively a traditional northbridge in between the chiplets and the system
Put 1 + 1 + 1 together and you might just get a fix for said issues.
They didn't do that because Windows throws a wobbly when it comes to ccNUMA nodes, they did it because some things, like I/O related IP blocks, don't scale as efficiently as logic blocks.
This is a Windows issue, not an AMD issue.
The Linux kernel handled 2nd gen Threadrippers just fine from launch day...
It's a radical change in topology.
Zen 1 uses a distributed mesh topology: core and uncore parts are spread between dies and access between any given core (or cache) and external memory is heterogenous due to differing paths and number of hops between the external interfaces and any given core or cache. This requires hierarchical nested NUMA domains to properly handle.
Zen 2 uses a centralised hub-and-spoke topology: the uncore is centralised, and ALL core-to-core and core-to-external accesses go through it, and are heterogenous beyond inter-core exchanges. In most cases (where you aren't deliberately digging down into the inter-cache level of memory management) you can get away with not worrying about NUMA at all as external memory access is now uniform across all cores.
Sure, but why did it end up not even mentioned by AMD until several months after release?
Standard MO would be to be working with MS loooong before product release to work out the bugs of sticking massive convoluted multi-core multi-domain chips n a consumer OS (even before the first tapeout you have a good idea of your architecture model). And even if MS were for some reason willing to deliberately make Windows perform worse, why would AMD just sit and twiddle their thumbs in silence rather than outright stating "performance is below that on other operating systems because Windows does X"?
It has no relevance if AMD or Microsoft caused the issue, because regardless of who caused it the victim is the performance (and sales) of AMD CPUs and as such it is in the interest of AMD to fix it (or workaround it with a change in topology).
I know, that's what i said, I'm glad we agree.
To be fair, I've not had hands-on time on a 29x0 Threadripper, but I've had some interesting times with a 2700X recently in Linux with PCI-E errors cropping up in one kernel version that didn't in others (Ubuntu 4.15.-42 and -45 are fine, *-43 is horribly broken). *-43 exhibited the same issues with first-gen TR and an 1800X, with no issues on Intel systems of comparable generations. That said, Linux handled scheduling a hell of a lot better than Windows did...
AMD can contribute patches to open source projects such as the Linux Kernel.
And they have done so. Which is why the 2990WX performs well under Linux.
AMD cannot fix Microsoft's OS. Only Microsoft can.
AMD doesn't even have access to the Windows source code.
Separate names with a comma.