This is going to be quite a technical question which probably as no known answer to anyone other than head engineers at Intel. However I am putting faith in the plethora of intellect that browse these forums (I hope!) Simultaneous Multi-Threading (SMT) or Hyperthreading as Intel named it is quite easy to understand. Each "core" has 2 logical processing units but a single execution engine. This is opposed to a core with one of each (this is a non-hyperthreading core). My question is, with Intel, AMD and even ARM pushing designs and chips thorugh with more and more cores, why did SMT stop at only 2 processing units. Why could each core not contain 3 or 4 processing units to a single execution engine? Are there certain design parameters that just make this impossible or has it just not been explored? Do multi-threaded applications have to use separate cores or, for most situations, would logical cores present a decent performance/cost and performance/watt alternative. So, anyone got a clue? Do not reply with stupid answers like "quads are just better - fact" because I will slap you. I want some kind of evidential proof saying why CPU's are as they as. Note: I'm pretty sure I've got all the basics correct in the above info but if I've got anything wrong please do tell me.
HT doesn't really provide better performance, but anything that helps, well helps. Also you must consider Moore's law (http://en.wikipedia.org/wiki/Moore's_law). Which state: He was right! But, every year? well things slowed down a bit so it's ~18 months, and things costing too much to do. So technologies where invested to try and improve things, like HT technology from Intel was developed. But today, things are going faster... so a focus on these kind of technology is no longer focused, and multi-core is pushed. Also consider, the business fact of purposely slowing things down. Why not release a 16 core CPU? When you can sell 6, than 8, then 10, then 12, 14 and then 16, especially that you can keep profit margin the same or higher for every version. Example: Remember the Pentium 4, where Intel didn't do anything for so many years, and it took AMD to release a not only better and faster CPU, with a much better chipset thanks to Nvidia, but also proper dual core, and 64-bit. It finally made Intel woke up, and pull from their own shelf and start producing their Core 2 Duo series and kick-start competition. With AMD lack of competing model to the Core i7 and Sandy Bridge processor.. I have a feeling, we will back to like it was with Pentium 4 days. Your GPU (and so is the GTX 260) has 192 processing cores. As for your CPU... well... competition was slow.. so we have what now, 6-cores.. oh boy... a long way to go... Well he was right. Thanks to improvements on laser technology and improved processing on manufacturing processors: and for fun... memory chips: I could be wrong, but right now, it could be that making multiple core is cheaper than investing in optimizing technology which only helps a bit. Windows 7 is ready. Windows 7 32-bit supports 32 core on one processor, and the 64-bit version, 256 cores.
Mmmmm... nVidia's SM 'cores' & whatever AMD call theirs (memory escapes atm) are multi-threaded & work very well (within their limitations)... i think the easiest way to look at it is that - a CPU is hugely versatile, have a large cache & has a vast instruction set which lends itself to more to spreading tasks between fast cores - SMT allowing free cycles to be used by other processes for some gain... ...whereas a GPU has a very limited instruction set which lends itself to repeated & predictable tasks which, more often than not, have a huge degree of parallelism - which gives much greater advantage to utilising a higher thread count. Now, the issue then comes that, because there needs to be the versatility that a CPU provides in order to get a reasonable speed from a modern computer, it is more advantageous to increase the core count to enable a better handling of the almost impossible number of combinations of simultaneous tasks that might be called for within a multi-tasking environment - such as a basic OS & some software - because there isn't enough parallelism within them... ...well, a GPU would be pretty dreadful as a main CPU (in today's tech). This does, however, mean that there is a disadvantage when a CPU tries to carry out massively parallel tasks - & is where the (historical) introduction of CUDA, Stream/OpenCL, DirectCompute, etc, comes in to allow programmers to offload tasks/sub-processes/etc which are better suited to parallelism onto the GPU. Which is all kind of a roundabout way of saying that there are some tasks which are suited to multi-threading & some which aren't - & more & more (well once CUDA dies a death & OpenCL takes over - well since it's open source & runs on nVidia & AMD card (& others) then it 'should' be the better outcome) this will be the way to get the best of both worlds.
Am I wrong in thinking that hyper-threading was initially developed, or at least utilised in servers? I'm pretty sure that Moore's law has actually stuck, something which Intel's engineers are very proud of. I would like to know if they tried a prototype chip with 4 threads per core, just out of curiousity
CUDA won't die. Every protocol (OpenCL, CUDA, DirectCompute...) has their strengths and weaknesses. That is like calling Java dead because it takes for every to cold start a Java project as it needs to load the Java environment, to read the Java binary code and execute the program in question. Yea it sucks... but it does have it's strength, which many other languages can only dream of having. CUDA is widely used in research everywhere, as it's easier to program, C can be used, so that algorithms don't need to be remapped, well documented, shared memory support, and has other features that the other don't have, and supports SLI (obviously).
Yes, on the Foster MP-based Xeon., which was out in 2002., and later came to the Northwood Pentium 4 wiki says http://en.wikipedia.org/wiki/Simultaneous_multithreading from the same site, and which also explain why HT technologies alike aren't pushed:
There are several reasons: 1) Qualification, cost and time to market. 2) IP. 3) Work load and power efficiency. 4) Engineering Itanium cores and other server parts can have 4-8 threads per core (iirc), depending on who makes it. That has existed for years, however the cost of qualification goes up exponentially the more you add: stuff has to work and not crash the chip. Qualification takes time and monies. That's why bespoke server parts that cost an arm and a leg get it. Their time to market is also a lot longer and the workloads differ greatly: they are predominantly multi-user small amount of software, or highly threaded software. Home PCs are single user, greatly varying software. Cache's, pre-fetchers and buffers have to be arranged differently - even between Xeon and normal consumer parts that fit the same socket, Intel codes the pre-fetch firmware differently. The very first P4s to market actually had hyperthreading installed but not enabled. It took Intel a further few years and a manufacturing node drop before they "launched" SMT. The same goes for the Core-line of CPUs. The first gen didn't have it because Intel was still qualifying it for Nehalem, which was server (plus it didn't have the manuf. node available to handle the extra transistor budget required). There's also workload and power efficiency to take into account as SMT isn't the only answer - SMT throws two decoded threads down the pipe, however Nvidia's (and in fact your IBM Xbox 360 and the PS3 PPC cores) prefer the "dual-issue" approach, where two instructions are launched from right at the start of the pipeline. Nvidia's shaders are in order though, not OoO like your x86 chip, which is more complex and again takes time to qualify. AMD is going down another approach as it recons that the FPU can be more efficiently used if it's blended between two cores. AMD claims the FPU logic is only actually uses about 40% of the time, and it's more efficient to make it wider - so it can handle more complex math, efficiently - while splitting it between two cores. This approach was actually taken to a greater extreme in the UltraSparc series of server CPUs; the first 8-core UltraSparc had 8 ALU cores but only a single (non-pipelined) FPU between them. AMD couldn't do hyperthreading because it simply doesn't have the engineering expertise and IP available, so it had to look for an alternative.
Goodbytes, thanks for the MIPS research, that is very interesting! I'll be reading up on that and keeping up to date with it where possible
http://www.amazon.com/Race-New-Game...=sr_1_5?ie=UTF8&s=books&qid=1300003203&sr=1-5 This is a good (but unfortunately, completely self absorbed) book by one of the IBM team that designed the PPC cores on the 360 and PS3 consoles. It discusses some of the ideas of CPU design in a relateable manner Otherwise, try and read a bit of realworldtech.com. David Kanter's insights are legendary and extensive, but VERY difficult to keep up with.
Firstly, both CUDA & Stream/OpenCL use a limited version of C - they're not directly compatible of course but, afaik, they both have similar limitations in the instruction set & have similar limitations in the way that they can be utilised. Then, secondly, i apologise as that wasn't quite what i meant to say... What i was meaning was that, to get from the comparatively limited adoption of its usage in software within the mainstream (well comparatively few programs can even use more than a single core - though of course that is increasing - NB i am discounting things like PhysX & Stream in games), now that Stream/OpenCL has reached the level of maturity & robustness needed to make it's use sensible, there would be more value in using that in mainstream apps. Well, whilst nVidia throw money at some companies to help them with CUDA, there are both limits in terms of direct implementation (it's under a proprietary license agreement which prevents either modification of the code which could improve speed/versatility/function for specific tasks or reverse engineering for code optimisation) & usage (since it's obviously only systems with nVidia GPUs that benefit). Okay, arguably, the number of companies/programmers who are going to tweak OpenCL would be limited however, now that OpenCL is robust, 'if' there is a choice between leveraging the parallelism of a smaller or a larger number of GPUs within a product for the mainstream market, it will make sense to aim for the latter... ...well, all you need is one product within a segmented market to do so in a useful way &, providing it runs roughly as fast as any pre-existing CUDA implementation, because it will run faster on non-nVidia GPUs as well then it will have a competitive advantage. i guess though, as an example, we'll see how CUDA & OpenCL compare semi-directly if it's actually the case that Adobe introduce OpenCL to the Mercury Engine in CS6 later this year - well as most of Adobe is accelerated through OpenGL & there seems to no strong recommendation on either choosing or avoiding one GPU manufacturer over the other for any bit of CS5 that doesn't use the Mercury Engine then it would be a reasonably fair test. Now, that's not to say that CUDA will not continue to be used within certain circumstances & i should have been clearer about what i was meaning.
IBM Power chips have 4 threads per core, so yes it has been explored. You should think of hyperthreading as a means to hide memory latency. It allows things to be done while data is being fetched, in a nutshell. Out of a nutshell it allows things to be done while waiting for other parts of the chip to repsond or become available, as well as mask memory latency. Like alot of good things it must come to an end and so hyperthreading/SMT cannot scale forever. There is only soo much idling going on in the execution engines and only soo much that extra threads can do to reduce that. Past a certain point adding more threads adds only a smattering of extra performance for extra space and power requirements on the chip. Given that CPUs are general purpose devices they are endowed with features that are beneficial in the general array of tasks they will be given. If another thread per core is only sometimes of benefit if this, that and the other conditions are met then it will hardly seem worth the effort. IBM server chips have 4 threads per core but they are also In Order designs. The potential for threads to stall is greater in these schemes and so having 4 threads helps keep the core busy doing something of use as much as possible. Four threads in an Out of Order design would not have the same benefits. Remember however that IBM machines run IBM designed operating systems and often IBM designed software. Hardware and software are designed for eachother with eachother in mind. There is lots of threads and its done in a (IBM) hardware friendly way. Your ignoring that the idea was not just to share the fpu but alot of the areas around the integer cores. CMT is NOT just a sharing of the fpu. Are you sure AMD couldnt do smt?? I do not remember that being seen on realworldtech.com and given that CMT is _alot_ more involved than 'mere' SMT it was an odd point to make. IP should not be an issue, IBM has been using SMT for years. Pretty sure the concept isn't copyrighted by any one party.