So, whats the deal with these things? I mean how does a video card (like the 8800 gtx) only have 128, and these new cards are pushing hundreds (thousands)? Like the HD4870 x2, it has 1600?! And the newer 260-280gt/x cards are pushing much lower numbers (200+). Is nvidia Vs ATI stream processor implementation different? How can ATI be pushing hundreds of these things, and yet nvidia is only pushing a few above there older cards? To me, if they are calling these things "stream processors" on both sides, surely they are the same tech. So to me a ATI card, for horse power, is kicking nvidias pants big time. Some clarity from my tech geek laden brethren?
Is nvidia Vs ATI stream processor implementation different? short answer yes, think of them as cylinders in an engine having more of them doesn't mean more power.
Afaik, nVidia and ATI use different implementations of stream processing, CUDA and Brook+ respectively. From AMD's site: Stream computing (or stream processing) refers to a class of compute problems, applications or tasks that can be broken down into parallel, identical operations and run simultaneously on a single processor device. The benefit of stream computing stems from the highly parallel architecture of the GPU whereby tens to hundreds of parallel operations are performed with each clock cycle whereas the CPU can at best work only a small handful of parallel operations per clock cycle. AMD: http://en.wikipedia.org/wiki/BrookGPU http://ati.amd.com/technology/streamcomputing/faq.html nVidia: http://en.wikipedia.org/wiki/CUDA http://www.nvidia.co.uk/object/cuda_what_is_uk.html
I hope this doesn't go too far over your head, but it's easier to explain by laying it all out on the table... The stream processors in Nvidia's and AMD's respective architectures are slightly different, but there's a reason for that. Nvidia's architecture is scalar, while AMD's architecture is superscalar. Basically, with Nvidia's architecture, you can throw any piece of code at it and it will run on as many of the stream processors as it needs - they all have the same functionality (bar the special function unit - there's one of those per eight stream processors but it's not part of the count) and are pretty generalised. It's a brute force method, although Nvidia would crucify me for saying that - you just throw code at it and it works everything out itself. On the other hand, AMD's architecture has blocks of five (well, technically six) stream processors that have differing functionality. Four of them can handle FP MAD, FP MUL, FP/INT ADD and dot product calculations), while the fifth unit can't handle dot products, INT ADD or double precision calculations, but can handle INT MUL, INT DIV, bit shifting and transcendental calculations (SIN, COS, LOG, etc). It's a bit more complex, but if the code is optimised well, it can deliver much higher performance - that's why the FLOPS throughputs on the AMD chips are quite a bit higher, too, because they only take FP MAD and FP MUL into account and all of the units in the AMD chips can do those calculations (they're the most widely used). I'd say the AMD architecture is a lot cleverer in many respects, but it does require a bit more work from the developer to achieve peak performance.
very nice explanation. but if ATI (aka AMD) require more work from developers, then why do they only sponsor one engine? Source engine? why not sponsor more engines such as CryEngine2 and UnrealEngine, that way, they'd get better performance and can easily win over nVidia. sooooo....... ATI basically have 800/5 = 160 stream processors if the code wasn't optimised for their pipelines?
ATI doesn't spend money on DevRel like Nvidia... I don't think it's part of their strategy (particularly from a financial POV). I've heard numbers from various sources relating to TWIMTBP expense (including marketing, devrel/devtech engineer wages/relocation packages/etc, running the Game Test Labs in Moscow and so on) hitting upwards of $200m a year. I don't know how true that is, but there are, after all, about 250-300 Nvidia engineers working with game developers across the world today... that's a pretty sizeable team. In a worst case scenario, yes, you could end up with just 160sp, but AMD does do some work in the driver to make sure the Ultra Threaded Dispatch Processor (the scheduler inside the GPU) is feeding as many tasks as possible to as many units as possible. Often, it's not easy to make code go that wide without it being designed that way though, so there's only so much that can be done from AMD's end. They do work with developers pretty actively, from what I've been told, but there's just not as big-a-splash as there is from Nvidia. One of the biggest benefits for AMD though is the fact that many games lead on Xbox 360 these days... and guess what GPU is inside that? It's ATI's first U.S.A. (unified shader architecture) chip and that's also a VLIW design (very long instruction word) - it's four wide not five wide though.
Nvidia - 1 in to 1 "general Stream", get 1* out ATI - 1 in to 1 block [with 5/6 "special Streams"], get 1* out
I knew that it was the differences in architecture which caused the massive differences in the number of processors.. And I think I understood enough of that... I mean, christ knows what INT MUL and INT DIV functions actually are.. but I at least understand the basic differences in architecture now.. I think. Thanks muchly, Tim. That's quelled some of my curiosity for now. I just have a related question, if you don't mind.. I've heard the term ALU bandied about.. but what is an ALU exactly?
sorry bindi, but the damage is done its fried! lol! Basically ATI are trying to be too critical in the way they process graphics data, a little bit like an F1 car its speed is down to alot of things, however Nvidia just strapped a massive engine on it and yanked the throttle open wide!
Exactly, it's Intel versus AMD all over again. It's the best posible way for things to work because otherwise we'd all be pretty specialised in either getting the silicon on the chip or the best way to process the information but have no clue on the other although I think ATI/AMD picked the short straw in this case. A good analogy would be Americain engines versus Japanese engines.
MUL is a multiply, while a DIV is a divide. INT is basically a integer, so you have a fixed range of values available (integers are talked about in bits... you'll probably hear 8-bit and 10-bit integers thrown around). FP is floating point, where there isn't a fixed range - it can float around and the only limitation is the accuracy (i.e. the bit-rate - most floating point calculations are done with 32-bit precision these days). Most anti-aliasing is done using integers these days because it's fairly accurate, but really fast. ATI used floating point units (the stream processors, in simple terms) to resolve AA in the 2000/3000 series cards and that meant the AA could be really high quality, but was often much slower. Ideally, you want a combination of both and that's what ATI is doing in the 4000 series hardware with its custom filters. An ALU is an arithmetic log unit - basically a unit that executes math and logic calculations. Think of it like something that takes data in (usually two pieces, but can be more) and then completes an operation using that data and then returns another value.
Right, I get it. I think. Thanks, Tim. So, it might, say, take in 1 and 2 and output 3 (therefore adding them)? I get it. Thanks again.
Yep, it's anything purely arithmetic rather than floating - ALUs are usually faster iirc because there's a certain amount of "fixed function" to them.
Mmhmm, thanks a lot Tim/Rich for that explanation. It does clear it up for the most part. So really a game can take advantage of a cards sp's if they have been optimized for that cards architecture. Like these rows of games that have the 'nvidia' logo in the beginning, those are going to run very well on a nvidia chipset, correct? I've always heard this from the two, in general from gamers and users. 'Nvidia has great driver support for a game, but ATI/AMD has generally more HP with lesser driver support'. This is what has driven my own nvidia purchases in the past, the support for games to fully take advantage of the card. So I guess a question would be to the bit-tech staff, what card would you use based on the technicalities and implementation of said tech? What company really hits the nail on the head? Again thanks for the explanations!
very interesting topic, cant wait to see what CUDA can do in 3D apps later on - Ill be sticking with nvidia for work stuff in the near future but this generation of ati cards have made me consider one a lot, any knowledge on when we might see some examples of the ATI card using its processing power in this way?... would be good to see how much performance can be rung out of each method using optimised code etc.
Tim, Reading this made me remember how much I enjoyed Ryan Leng's In-depth memory articles. In fact, the only reason I visit AnandTech at all is because of their detailed technology analyses. Is there any chance we can have Moar, on differing subjects?
Nvidia gpu design = American Engine Design = Increase Engine Size V12 ATI gpu design = European/Asian Engine Design = Same,smaller engine designed better ....but theres something about a V12 Hemmy
looking back at my purchase of the GTX280, makes me remember that X58 will support SLI, and if things plan out for the best i shall be in the owner ship of a 30in TFT in the near future, so SLI GTX280 sounds really nice. Even tho the performance cost things is a little painful!