Coffeehouse Thread

18 posts

Toshiba shows a video of a demo of Cell decoding 48 MPEG-2 video streams at once

Back to Forum: Coffeehouse
  • User profile image
    Deactivated User

    Comment removed at user's request.

  • User profile image
    Tom Servo

    It's time to ditch x86 and start from scratch.

  • User profile image
    msemack

    Tom Servo wrote:
    It's time to ditch x86 and start from scratch.


    They tried that, it's called the Itanium (IA64).  Didn't go over too well.  That's why we have x86-64 (EMT64, AMD64, whatever).

    Why do you think x86 needs to be scrapped?  What evidence do you have that the x86 instruction set is holding back the CPU world?

    Intel and AMD have both been able to make some radical improvements with little evidence of slowing down.  We'll probably hit the limits of Moore's law long before we hit the limits of x86 emulation.

    One important thing to realize:  The Cell was demonstrated doing a few specialized tasks really well.  That doesn't make it a good general purpose architecture.

    One of our systems has a 180MHz Nexperia DSP chip that can do real-time video processing faster than a 3GHz Pentium 4.  That doesn't mean the Nexperia would make a good general-purpose CPU.

    You can't hold this up as evidence that x86 needs to be ditched.

  • User profile image
    rhm

    Shmuelpro wrote:


    Posted yesterday.

  • User profile image
    Shining Arcanine

    msemack wrote:
    Why do you think x86 needs to be scrapped?  What evidence do you have that the x86 instruction set is holding back the CPU world?


    I've read developers say that Intel did amazing things with the Itanium and its instruction set and an Itanium at 1.6GHz outperforms a Pentium 4 at 3.8GHz by a factor of two with recompiled code. Given that, if the instruction set isn't a slow down, I don't know what is.

  • User profile image
    msemack

    Shining Arcanine wrote:

    I've read developers say that Intel did amazing things with the Itanium and its instruction set and an Itanium at 1.6GHz outperforms a Pentium 4 at 3.8GHz by a factor of two with recompiled code. Given that, if the instruction set isn't a slow down, I don't know what is.


    And what was that recompiled code doing?  You can't draw a conclusion of the overall performance of the CPU or instruction set based on a single code sample.

    Your story is no more conclusive than mine about the 180MHz Nexperia.

    Furthermore, you can't conlude that the x86 INSTRUCTION SET was the limiting factor.  Remember, the Itanium has a MASSIVE cache.  Perhaps the application was really cache-friendly.

  • User profile image
    msemack

    Just to expand on what I said:

    The Cell processor, the Itanium, and the Nexperia DSP I was talking about are all derivatives of the same design philosophy: VLIW.

    VLIW works great where you have a fixed algorithm that is highly parallel.  That's great when iterating through a massive array of data (like a video stream).  That's perfect for a fixed-function device (MPEG compressor), or a scientific application.

    In those cases, you can carefully craft your algorthim to take advantage of the CPU's various execution units.

    However, these are small slices of the software pie.  In general-purpose applications, you have a hard time keeping all of the chip's units busy.

  • User profile image
    Deactivated User

    Comment removed at user's request.

  • User profile image
    msemack

    Shmuelpro wrote:
    The tanium didnt succeed because of one - it was and still is extreamly exspensive for the normal user because it was created for the "high-end" market and two - it is not backward compatible unlike the new 64bit processors ( although they are not 16bit compatible).


    Intels plans for a low-cost Itanium were scrapped because no one was interested (Dell wasn't going to sell it).

    You underscored another point.  Backwards compatiblity is king.  Saying "we need to ditch x86" sounds good and all, but it's not going to happen.

  • User profile image
    Tom Servo

    With that attitude we'll never see an overhaul of the PC architecture. The newest P4 and Athon X2 still support freakin' real mode. I mean, what the hell!

  • User profile image
    msemack

    Tom Servo wrote:
    With that attitude we'll never see an overhaul of the PC architecture. The newest P4 and Athon X2 still support freakin' real mode. I mean, what the hell!


    Why should it be overhauled? 

    Because it's old?  Certainly not.  It's a proven architecture.  Just because it's not shiny and new doesn't mean there's anything wrong with it.

    Because it's ugly?  Granted, there are a lot of "warts" in the design.  However, that still isn't a compelling reason.  Just because the PC offends your sense of elegance doesn't mean it should be re-done.  It works, and it works well.  It's "inelegance" doesn't interfere with me using the PC.

    You're upset about real mode being there?  Why?  Is it hurting something?  Or do you just not like it?

  • User profile image
    Tekmaven

    Funny thing is, if you get back to the original source with the actual picture, it's a photograph of Windows Media Player 10 playing a video.

    I can do the same thing with a program like Adobe Premiere Pro and play it in WMP10.  Smiley

  • User profile image
    Tom Servo

    msemack wrote:
    Why should it be overhauled?  Because it's old?  Certainly not.  It's a proven architecture.  Just because it's not shiny and new doesn't mean there's anything wrong with it. Because it's ugly?  Granted, there are a lot of "warts" in the design.
    Overhaul means mostly fixing the processor on short term. The rest is already being taken care of, that's why we have i.e. PCIe, DDR2 and SATA (coming). I'm no expert on the design of x86 CPUs, but I'm sure throwing out unnecessary legacy would do good to the processors. Like using the transistors that have been wasted on real mode and 16bit protected mode functionality for introducing new stuff akin to the cell CPUs specialized coprocessors (or whatever you call it), because SSE et al doesn't seem to cut it for decoding 48 MPEG2 streams in realtime, or at least half of that. Seriously, it can't be that a console, that costs less than a good recent processor, kicks its a PC in its rear royally.

    Maybe thanks to managed code, it'll be possible to introduce a better CPU architecture, without pulling a Dell.

    --edit: Thanks Channel9 forum, for stripping the paragraphing again!

  • User profile image
    msemack

    Tom Servo wrote:
    I'm sure throwing out unnecessary legacy would do good to the processors. Like using the transistors that have been wasted on real mode and 16bit protected mode functionality for introducing new stuff akin to the cell CPUs specialized coprocessors (or whatever you call it)


    Implementing that legacy stuff is a tiny portion of the overall transistor budget (like 5-10%, not counting the L2 cache).  How much do you thing you can get out of that extra 5-10%?  Then, you have to decide if that extra 5-10% is worth the loss of compatiblity.

    Furthermore, those transistors are a fixed cost.  It doesn't require any more transistors on the latest P4 than it did on the P3.  As we manage to squeeze more and more transistors onto a die, that cost becomes less of an issue. 

    The idea that x86 support is somehow "holding PC's back" is a popular one, but it doesn't really hold up.

    Tom Servo wrote:
    because SSE et al doesn't seem to cut it for decoding 48 MPEG2 streams in realtime, or at least half of that. Seriously, it can't be that a console, that costs less than a good recent processor, kicks its a PC in its rear royally.


    You're drawing some pretty major conclusions from a sample size of 1.  You can't plot a graph very well with a single point of data.

    48 MPEG streams is an impressive feat.  However, it is a single, very specialized task.  Doing one task well doesn't mean that it will do other tasks well.  Sure, a Pentium 4 can't do it very well, but this isn't exactly something people do in the "real world".

    The Cell (and other VLIWs), shift the burden of code optimization to the compiler.  If your code has a pattern the compiler can't detect and optimize, your performance is pitiful.

    VLIW cores thrive on SIMD algorthims (like an MPEG stream).  However, they don't work very well when you aren't iterating through massive data sets.  Things like browsing the web, running MS Word, typing e-mail, etc. won't benefit from such a design.

  • User profile image
    Mike Dimmick

    I don't know exactly how modern x86s are constructed, but I'd take a guess that the legacy processor instruction sets are implemented in slow microcode rather than in direct translation. That's a small amount of (flash) ROM area, which I would expect to be dwarfed by the branch prediction and scheduling units. And especially by the cache.

    Modern x86s don't really implement the x86 instruction set internally. They have internal microinstructions which actually execute on the core. To get from x86 instructions to microoperations (uops) [Intel terminology] either requires a microcode program (slow) or circuitry which converts the x86 instruction into a small number of uops (much faster). The resulting uops are sent to the instruction scheduling circuitry which decides which operations will execute on which functional units for the next cycle.

    If you ditched the older modes you would be unable to use existing BIOS code (which is 16-bit) and would lose the ability to run legacy operating systems (OS boot code is 16-bit). That includes the recently-released Windows XP x64!

    These days ICs are generally designed using hardware description languages, which are a lot like programming languages. I suspect the behaviour is largely the same for hardware developers as software developers - don't change code unless you have to.

  • User profile image
    msemack

    You are right.  We won't be able to ditch real mode supprt until the PC world replaces the BIOS with EFI.

    I wish that would happen... but it probably won't for many more years.

  • User profile image
    TwoTailedFox

    Shining Arcanine wrote:


    I've read developers say that Intel did amazing things with the Itanium and its instruction set and an Itanium at 1.6GHz outperforms a Pentium 4 at 3.8GHz by a factor of two with recompiled code. Given that, if the instruction set isn't a slow down, I don't know what is.


    There is a reason the Itanium isn't selling well.

    It's all down to how well the 32-Bit IA-32 code runs on the Itanium. It runs *horribly* slow...last I read, it ran at a 10x speed penalty. And, since the majority of customers run 32-Bit software... No, the Itanium would be a very poor investemest.

    Hence, why AMD came up with x86-64 (And what Intel ripped, to create EM64T). Simultanious 32-Bit and 64-Bit code execution (Using the native x86 architecture to run IA-32 code at full-speed).

    I'm interested to see how long the Itanium Series will last, with the Opteron, and Xeon chips on the market

  • User profile image
    Mike Dimmick

    TwoTailedFox wrote:
    It's all down to how well the 32-Bit IA-32 code runs on the Itanium. It runs *horribly* slow...last I read, it ran at a 10x speed penalty. And, since the majority of customers run 32-Bit software... No, the Itanium would be a very poor investemest.


    The hardware IA32 emulation is simply a direct translation, using a microcode ROM, into core instructions, I believe. It does not perform out-of-order execution, and it does not do advanced instruction scheduling. In other words, it runs a lot like a 1.6GHz 486. I'm not even sure it manages to do much parallelism.

    Note that Itanium processors don't do any out-of-order execution or rescheduling on the native instruction set either. Itanium code is explicit about which instructions can be executed in parallel and where some operations must wait until previous operations have all completed.

    To speed things up a bit, Intel and Microsoft have released the IA-32 Execution Layer. This is a piece of software which, instead of switching the processor into IA-32 mode, translates the instructions to Itanium Processor Family instructions. It's basically a JIT compiler, but I believe it has the ability to recompile 'hot' blocks of code to improve performance. Because it has the ability to look ahead further, it can do a better job of the translation.

    So why buy an Itanium at all? While it may stink on x86 code, on its native instruction set it's better at integer calculations than most x86s (clock-for-clock). Only recently are x86s beating Itanium, and only because the clock rates - both CPU and front-side bus - are far higher. For example, on SPEC CINT2000, an HP HP Integrity rx4640-8 1.6GHz/9MB cache Itanium 2 computer scores 1590 base, 1590 peak. A ProLiant DL360 G4p (3.6GHz, Intel Xeon) scores 1675 base, 1712 peak (the difference in base and peak scores can be explained by the hardware instruction scheduling). Not currently looking great - Intel need to improve the Itanium clock speeds.

    On floating point it's better. The Itanium scores 2712 while the Xeon has 1800, 1825 peak. That's with compiler options set that should make use of SSE3 rather than x87 floating point. I can't currently find any Opteron or Xeon results on www.spec.org that use x64 instructions on Windows - there's a Fujitsu Opteron 252 running Linux that scores 1579/1759 integer, but they don't seem to have submitted FP results.

Comments closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.