AMD Fusion Developer Summit 11

AFDS Keynote: Herb Sutter - Heterogeneous Computing and C++ AMP

Download this episode

Download Video


Herb Sutter introduces the AMD Fusion Developer Summit 11 crowd (and the world!) to Microsoft's view on heterogeneous computing in the concurrency age and introduces one of Microsoft's upcoming technologies for democratizing GPGPU/APU/Multi-Core/Many-Core programming for native developers: C++ Accelerated Massive Parallelism or C++ AMP. Look for C++ AMP and associated tooling in the next version of Visual C++.

Big thanks to AMD for generously providing Channel 9 with this outstanding content!

Herb and the C++ AMP team state: C++ AMP will lower the barrier to entry for heterogeneous hardware programmability, bringing performance to the mainstream. Developers will get an STL-like library as part of the existing concurrency namespace (whose Parallel Patterns Library – PPL- and its Concurrency Runtime – ConcRT- are also being enhanced in the next version of Visual C++) in a way that developers won't need to learn a different syntax, nor using a different compiler.

C++ AMP is an open specification.

You learned in the C++ Renaissance conversation with Mohsen Agsen and Craig Symonds that the C++ team was on a path of innovation. C++ AMP is a concrete example of what Mohsen and Craig were talking about.

Learn more about C++ AMP:




C++ AMP, AMD, C++









Available formats for this video:

Actual format may change based on video formats available and browser capability.

    The Discussion

    • Fabio Galuppo

      Great! Congratulations to Herb and WinC++ team...

    • Michael Lewis

      Is this much different to OpenCL, Cuda, or rather the Thrust Library that ships with Cuda now. Probably not.

    • Charles

      @Michael Lewis:  This is a C++ advancement. OpenCL is great, but has nothing to do with adding GPGPU support to C++.

      Cuda is nVidia-only. C++ AMP works for any DX compatible hardware and is an open specification, which means you are not required (as other implementations arise) to use Microsoft's C++ toolchain, but you will be required to use C++.


    • R J

      PPL and AMP sound promising specially that it is open specification and has STL like interface but in my opinion it is going to be adopted by the community only if it has implementation for platform other than Windows/DirectX.

    • new2STL

      Brilliant, sure the 1st step on integrating C++ multiple CPU and GPU.

      @Charles: I'm looking futher in this direction, current DC, OCL and CUDA all use a C-like language and all 3 demonstrated desire in move to C++like or have c++ friendly API. Now with the Unified Memory Addressing in CUDA I think the other 2 can also develop this feature and make the next step on integrating a fully massive parallelism (library/API). AMP being an open standard not doubt it can be ported to other platform combinations and influence the OCL/CUDA.

      Again, great job. My fingers getting crazy to use it. Wink

    • PhrostByte

      I'll be interested to see if GCC and ICC implement this extension.  Lots of work but possibly well worth it.  For now, it sounds like a great way to quickly add some GPU support to apps intended for Windows Vista and up.  For anything truly portable, OpenCL will remain king.

      Does AMP support all the GPU-specific stuff, like math function builtins and texture units?  How about 2D/3D data?  Can I specify that a buffer should stay in GPU memory (or to be generic: in the memory closest to the execution unit), so it's not passed back to system RAM in a multi-pass function?  Exciting!

      Downloading the video right now.  Been waiting for this all day Smiley

    • Charles

      Herb says:

      The main reasons we decided to build a new model is that we believe there needs to be a single model that has all of the following attributes:

      C++, not C: It should leverage C++'s power for strong abstraction without sacrificing performance, not just be a dialect of C.

      Mainstream: It should be programmable by millions of developers, not just by a priesthood. Litmus test: Is the Hello World parallel GPU program a page and half, or a couple of lines?

      Minimal: It adds just one general-purpose language extension that addresses not only the immediate problem (dealing with cores that can't support full C++) but many others. With the right general-purpose extension, the rest can be done as just a library.

      Portable: It allows shipping a single EXE that can use any combination of GPU vendors' hardware. The initial implementation uses DirectCompute and supports all devices that are DX11 capable; DirectCompute is just an implementation detail of the first release, and the model can (and I expect will) be implemented to directly talk to any interesting hardware.

      General and future-proof: The initial release will focus on GPU computing, but it's intended to enable people to write code for the GPU in a way that in the future we can recompile with few or no changes to spread across any and all accessible compute cores, including ones in the cloud.

      Open: I mentioned that Microsoft intends to make the C++ AMP specification open, and encourages its implementation on other C++ compilers for any hardware or OS target. AMD announced that they will implement C++ AMP in their FSA reference compiler. NVidia also announced support.


      Please ask questions.


    • new2STL

      @Charles: we have 2 distinct posts over the same video, is not better (or have a way) to have the two merged before people ask questions?

      * (channel9::Events)AMD-Fusion-Developer-Summit/AMD-Fusion-Developer-Summit-11/KEYNOTE
      * (channel9::posts)AFDS-Keynote-Herb-Sutter-Heterogeneous-Computing-and-C-AMP

    • Michael Lewis

      Thrust library gives the C++ interface to CUDA, i.e. no low level CUDA within Thrust. AMP will become the C++ equivalent of the Microsoft 'Accelerator' library; for me it looks like the MSFT equivalent of Thrust for DirectX.

    • Mike Gibson

      This is awesome. Given that the restrict keyword and writeonly additions are totally new, how willing are you to allow the legendary C++ language community to figure out how to “do it better”?. I can foresee that many folks won’t like how MS has done this stuff.

      I think that the writeonly idea is even more broadly applicable than the restrict stuff.

      What you’re doing with restrict is just adding a way to add more qualifiers, like const and volatile. I see the need to qualify functions/lambdas/etc. in a number ways, but just adding more and more keywords isn’t the way to go. Perhaps you extend restrict to not just this parallel stuff, but to include *all* the restrictive qualifiers that one can use, like const and volatile. Then the writeonly stuff can just be included directly along with the rest of the restrictions.

      I can envision a system whereby you can create new restriction types based on other restriction types, all built up from a core set of restrictions like const, volatile, writeonly, pure, direct3d, etc.

    • Steve Miller

      Love It! Just a short question:
      This will only work on DX11 Hardware? I was thinking about Windows Phone which currently only support DX9. Personally, I think that the more ressource constrained the hardware is - the more you have to go to the metal.

    • Charles

      @Charles: we have 2 distinct posts over the same video, is not better (or have a way) to have the two merged before people ask questions?

      * (channel9::Events)AMD-Fusion-Developer-Summit/AMD-Fusion-Developer-Summit-11/KEYNOTE
      * (channel9::posts)AFDS-Keynote-Herb-Sutter-Heterogeneous-Computing-and-C-AMP

      Yes. Event posts do not end up in our default RSS feed and so I wanted to get this maximum exposure. What I should have done in restrospect is added these to an Event for archival purposes and ease of discovery in Events section. I've asked that we merge these two threads so keep comments in one place.


    • Charles

      @Steve Miller: DX11

    • Charles

      [snip] for me it [AMP] looks like the MSFT equivalent of Thrust for DirectX.

      C++ AMP isn't just an STL-style wrapper on top of DirectX... Whereas it's great to see that Thrust provides a C++ interface to Cuda, you shouldn't assume that this alone makes it the equivalent of C++ AMP since C++ AMP is much more than a C++ wrapper for DirectX... Smiley

      One obvious difference is that Thrust (since it's a Cuda wrapper) only targets nVidia whereas C++ AMP targets any DX-compatible hardware.The language extension in C++ AMP also adds more value to C++ for general purpose programming outside the context of GPGPU, as Herb stated in his keynote address. The bullet points of Herb's I posted here last night should make this clear.

      There are several other differences that I can't share with you since I'm not the one who can. But this is C9 and we like questions to go at least replied-to...

      You'll find out many more technical details when the time is right and you'll hear from the right people...



    • Speed8ump

      I'm not that familiar with existing GPGPU systems (I've used CUDA once in a demo). The ability to compile once and execute anywhere with a single binary reminds me of Java's promises. You're implying that the host CPU (running instruction set A) can load and execute code onto some other processor (running instruction set B, B != A). Doesn't that imply there's a compiler for instruction set B on the host CPU? Will this compiler be provided by directX?

      When you do this won't you have everything in place to also (finally) support the C++ 'export' keyword?

    • Tomas

      @Speed8ump: Such compilers already exist and are shipped with current graphics drivers. For example OpenGL drivers which support GLSL shading language export functions which accept GLSL source code as a string and compile it using instruction set of graphics card.

    • DeadMG

      @Speed8ump: That's already exactly what happens with HLSL. There's nothing new here from that perspective.

      Firstly, instruction set B is known in advance, so you can compile it off-line, there's no need to compile on the target, but secondly, Direct3D already ships HLSL compilers that you can call at run-time if you like.

      Oh, and thirdly, export is gone in C++11, so Microsoft have no reason to support it.

    • new2STL

      @Speed8ump: Want to add my ¢0.02 here, the codes generated by the GPU compilers are near metal, but still a meta language that follow a strict ISA that are converted on the fly, take as example AMD Radeon HD 5xxx and 6xxx, some models have a SIMD width of 5  and the others a width of 4, and variable number of cores. When you generate the compiled shader (or now GPGPU code) you target the ISA and the driver apply the hardware specific needs.

      HLSL/DirectCompute generate code for a type of ISA too, it grants the code runs on a variety of hardware that follow that interface, in the case, shader model 5 (with backward compatibility to 4 and 3)

      CUDA and OpenCL uses the PTX.

      The interesting of this ISA is it can be converted not only to GPU assembly but to x86 or any other instruction set too.

    • new2STL

      Having put about the virtual ISA I think now I understood better the part about it being minimal and portable.The restrict() keyword can be used to target the ISA, but in this case I think would be better call the restriction like sm_5, or dc_5, instead of "directx".

      Unless the idea is to think ahead and call next restriction() with "opencl" and "cuda". But again, what version of DirectX, OpenCL and CUDA ISA?

      (DirectX shader models: 5, 4, 3 ?)

      (OpenCL: 1.0, 1.1 ?)

      (CUDA compute capability: 1.0, 1.1, 1.2, 1.3, 2.0, 3.0 ?)

    • Herb Sutter

      @new2STL: One of the design features of restrict() that I didn't mention was versioning, mostly because it doesn't matter so much in the initial release. But it's already been designed, in particular that restrict(direct3d) implies restrict(direct3d:11) and that version numbers can grow as language support improves (e.g., restrict(xyzzy:2 is a strict superset of restrict(xyzzy:1)) where with each release we can bump the implicit default since it's backward-compatible so that you can just ignore the version number most of the time, unless you want to overload on the version which also just works naturally.

      @Michael: Thrust is fine for CUDA, and we think it's great that there's a CUDA wrapper for C++. We think there are additional advantages to be had by targeting C++ directly instead of as a wrapper for something underneath that was designed for C. For example, I mentioned in the talk that C++ AMP is language-integrated with one very general feature that works with everything that's already there (templates, overloading, ...) rather than multiple special-purpose features. It also doesn't force the user to write explicit .copy()/.sync() because array_view provides a stronger abstraction that can abstract away both today's and tomorrow's memory models.

    • new2STL

      @Herb Sutter: " But it's already been designed, in particular that restrict(direct3d) implies restrict(direct3d:11) and that version numbers can grow as language support improves"

      Thank you, nice to know it, I see it now in the slides from Daniel Moth Wink

      I always think this sort of stuff is important because I like to think ahead and prepare myself (you never know when a program source will return to you like a boomerang!)

    • Spongman

      ok, i love the fact that I can write 'portable' C++ code in my pixel shaders. but the problem is that as soon as I write "recstrict(direct3d)", my code is, by definition, non-portable. how are you guys going to address portability between 'restrictions' implemented by different compilers? wouldn't it have been better to use non-platform-specific restriction identifiers?

      who writes/distributes the AMP runtime components? are they a standard OS component (.so/.dll), or are they a .lib baked in at compile-time? how will these be serviced (apropos GDI+ nightmare)?

      will non-Microsoft, direct3d-targeting compilers use Microsoft-supplied direct3d runtimes, or will they have to write their own?

    • Charles
    • Spongman

      @Charles: errr, i don't think herb's comment there answers any of my questions.

    • Herb Sutter

      @piersh: The use of DirectCompute under the covers is a "version 1" implementation detail that lets us get wide hardware reach right out of the gate, but the programming model doesn't require or assume that implementation. Like the existing PPL parallel_for_each, the high-level algorithm could be implemented under the covers on top of anything from an old-style thread pool (though you wouldn't really want that because of the performance), to a work-stealing runtime like ConcRT (much better performance and what we use for multicore in C++ PPL today), to DirectCompute (what we'll be using in C++ AMP v1 for GPUs), to native access to particular GPUs, to other underpinnings. The idea is to have a common programming model that is flexible and abstracted enough to let us and other implementers keep delivering engineering improvements in the underlying runtime without disturbing the developer's code.

      As for naming, we did consider (and are still considering) whether "direct3d" is the right name for this particular restriction qualifier; there are a number of reasons for and against when considering how this would evolve in the future as restrictions relax over time.

    • primeMover

      Microsoft again late to the party?

      C++AMP only addresses a very small niche within the community of C++ developers: Those that have only customers with Windows machines, customers that do not run XP anymore, customers with DX11 gfx cards and developers that don't utilize the GPU yet.

      It's again Windows only, because it's an "open" standard based on some obscure Microsoft technology.

      I've used the PPL in VS2010 but I must admit that I was shocked to see that it performed way worse than Intel's TBB, OpenMP and even C# in several performance critical parts of my code.

      Maybe Microsoft should have a look or two at some "old" technology (PPL), your customers rely on.

      What strikes me most although is the fact that Herb Sutter's mantra was "forget about the GPU when it comes to parallelism". What led to that change?


    • Ian

      Herb, Why was the restrict keyword not implemented as an c++11 attribute? Was that design considered? If so why was it rejected. C++ AMP looks awesome, regardless.

    • Charles

      @primeMover: It's an open specification. It is therefore not platform-specific, in principle (and in practice). Did you not watch or listen to Herb's keynote? Fast forward to the 54 minute mark and listen to what he says.


    • Herb Sutter

      @Ian: Thanks! Alas, attributes aren’t the answer because they’re just decorations, not part of the language. For example, you can’t overload on them.

    • primeMover

      @charles: Nope. I already heard that too often to be able to believe that. Is there any reference? ISO, ECMA? Or is there just the claim?

      I also witnessed the claim, that the PPL will be super fast, which it isn't in the most cases. Where's the performance improvement for the combinable class? When are you going to invest in a better implementation of the PPL? Or is C++AMP the new PPL? Will you introduce an EASY migration path?

      Until there are less than 3 implementers of this "Open Spec", this is again a huge step in the wrong direction.

    Comments closed

    Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.