@SteveRichter:

AFAIK it's the memory / cache architecture bottleneck. A core will only run at maximum speed if the currently executing code AND accessed data fits in their level 1 (L1) caches.
Todays L1 caches are about 32-64 KBytes only! L2 is about 256KB; L3 some MB, but shared for all cores!
Beyond these caches, all cores compete against the same, much slower main (flat) memory (4GB typically today).
Now compare these numbers with the RAM consumption of todays apps / OS, each needing many MBs, adding up to GBs.
Another point I see in software is the often inappropriate, aimless use of too many threads (explicit or hidden, by overrated 'smart' libs), forcing cache contents to get invalidated for each context switch.

To solve these problems, BOTH the CPU architecture AND especially the software design has to be changed fundamentally.
Until this happens, adding more and more cores will not provide that much more performance, it simply can't scale up that fast.
IMHO Intel knows how to solve the hardware side very well (they have prototypes, and others have done it before), but NOT Microsoft about SW / OS design.
MS is wasting enormous resources with confusing, worthless Metro/WinRT concepts, instead of a fresh, clean and lean OS restart.