Loading user information from Channel 9

Something went wrong getting user information from Channel 9

Latest Achievement:

Loading user information from MSDN

Something went wrong getting user information from MSDN

Visual Studio Achievements

Latest Achievement:

Loading Visual Studio Achievements

Something went wrong getting the Visual Studio Achievements

Daniel Moth: Blazing-fast code using GPUs and more, with C++ AMP

46 minutes, 22 seconds


Right click “Save as…”

Herb Sutter recently announced C++ AMP at the AMD Fusion Developer Summit as part of his keynote. Here, Daniel Moth, a program manager on Microsoft's Parallel Computing Platform Team, digs deeper into C++ AMP with code samples and more. Please download the slides from the link below as the recording of this session doesn't do them justice.

Big thanks to AMD for providing Channel 9 with this excellent content!

To get full performance out of mainstream hardware, high-performance code needs  to harness, not only multi-core CPUs, but also GPUs (whether discrete cards or integrated in the processor) and other compute accelerators to achieve orders-of-magnitude speed-up for data parallel algorithms. How can you as a C++ developer fully utilize all that heterogeneous hardware from your Visual Studio environment? How can your code benefit from this tremendous performance boost without sacrificing your developer productivity or the portability of your solution? The answers will be presented in this session that introduces a new technology from Microsoft.

Get the slides to this presentation here.

Learn more about C++ AMP:


Follow the discussion

  • Oops, something didn't work.

    Getting subscription
    Subscribe to this conversation
  • DeadMGDeadMG

    The projector is way too bright- can't see any of the code. Good job posting up the slides.

  • CharlesCharles Welcome Change

    @DeadMG: It was filmed from the back of the room and the lighting wasn't ideal as you've noted. Best to download the slides and follow along. At least we have a recording of this!


  • BassBass Knows the way the wind is flowing.

    Very nice stuff.

  • DeadMGDeadMG

    @Charles: Do you know if we can use arrays for indices? Like, imagine that I have an array of input and an array of output, and when the input changes, I want to mark some elements as "dirty" and then re-process them, without re-doing the whole array.

  • @DeadMG:

    Not sure I totally get your scenario and without you having the bits to try, not sure we can explore it much further.

    Just remember that you cannot copy a host array over to the GPU and expect a modification on the host side to affect it (without recopying back and forth). You can leave an array on the GPU and run a different kernel on it, but that different kernel would need its own algorithm to determine what data to operate on and what not – here remember that if you introduce a lot of branching in your kernel you may see a perf hit larger than the perf gain you are attempting to achieve...

    If you want to explore this further, feel free to share your CPU algorithm and we can, otherwise I suggest waiting for the bits so you can explore it hands on Wink




  • how are you guys going to handle versioning the restrict() restrictions?

    will we be able to include different versions of our restricted functions for different versions of direct3d? how will this work, since the override would have to happen at runtime?

  • erikerik

    I tried CUDA a few weeks ago and saw that during heavy GPU computations the screen does not update. This renders the use of GPU computations in regular desktop software (like hashing, encryption, etc) totally useless.

    How does DirectCompute (the core of C++ AMP) handle multiple tasks?
    E.g. doing some heavy calculation on 300 GPU cores and allowing Windows to update the screen on the remaining cores?

    I hope I am making sense here. Maybe I am looking at it the wrong way but I would really like to see GPGPU take flight. But it won't if people can't use their computer for anything else while the calculation is running.

    Doing one single task when you have over 400 cores seems highly inefficient.

  • DeadMGDeadMG

    @Daniel: Have a look at this code- I think it's much easier to understand than what I actually said.


    The second algorithm which would run on the GPU undoubtedly doesn't output the correct results because I didn't communicate the changes to the input vector across to the GPU. I presume that this facility must exist, and we just haven't been shown the syntax/semantics.

  • GordonGordon

    It's not portable, not even close.
    Where's the opengl support

    This will only run on windows. No linux or mac support.

    You call this portable ? Did you hit your head or something ?

    Bloody hell, the nerve of calling this portable.

    It's not portable if linux can't run it.

  • Will AMP come to .net? I love the parallels Smiley

  • DeadMGDeadMG

    @Gordon: "Portable" does not mean "I implemented it personally for every single platform", and especially not for the first release, which is explicitly not supporting half the things they want to support. There's no reason that it can't support OpenCL, and just because the first version doesn't, doesn't make it not portable.

  • mm!good video till he say that we will never just use gpu.i beg to differ,it is only a short while from now till we see the first full gpu system,the difference needed are so minimal that we ll come to a point where everything will be done in gpu (yes it will means no more cpu)and the first doing it will probably be ati

  • Nice work, looking forward to hearing more about this. 
    A few questions regarding how restrict(direct3d) is intended to work:

    • Are functions marked with restrict(direct3d) still callable from CPU functions?
    • If a machine does not have a DX 11 capable device is there an implicit multi-core SSE fall back?
    • How will versioning of the direct3d restrictions be handled when DirectX 12 comes out?
  • @Fredrikkarlsson: We have nothing to announce with regards to C++ AMP technology coming directly to .NET. However, you can write a C++ AMP dll and use that from your .NET code... I'll put a sample of that on my blog at some point...

  • @drbaltazar: The comment you are referring to was on a slide that included the word "today". My following future-looking slide (which included the word "tomorrow") left the door open for whatever and our design is definitelly future proof.

  • @erik: I have not observed this with my tests. Please try DirectCompute/HLSL and see if you observe the same results. If you do, then you will with C++ AMP too, since this is a driver thing, not a programming model thing.

  • @Gordon: The word portable was within the context of hardware, which I mentioned every time I mentioned the word portable.

  • @piersh: When we release there will be no need for versioning. We have various design options for future releases where versioning may be required. Remember, the versioning would only help in relaxing restirctions and allowing you to "do more" in your kernel code, hence recompiling would be necessary regardless and you just need a way to delcare what restictions you want to adhere to.

  • @Londey: Please see my response to piersh on versioning. Yes you can have a function be callable from both CPU and direct3d code by combining the restictions e.g. restrict(cpu, direct3d). This is covered in the talk. For an implicit fallback to SSE, we have nothing to announce today, but stay tuned Wink

  • @DeadMG: First let me say "wow!". I can't believe you wrote all that code without a compiler after seeing just one slidey talk. I haven't run it through the compiler, but it looks like it would compile. The only thing you need to add is a call to refresh on the input_view array_view so it can reflect the changes you made to the input vector. The other way to have done it is to use input_view directly on the CPU side to update it (and the changes would immediatelly propagate to input). There are more considerations (particularly around performance) depending on whether the data you access in the second kernel invocation are large/small, sparse/dense but that will have to do for now...

    So the answer to your original question is that to use arrays as indices, you would have to do exactly what you did in your code... there are no other provisions... Feel free to contact me offline to talk about those.

  • @Daniel Moth: I think you missed the point of my question.

    in the future, suppose:

    - you have a C++AMP compiler that targets direct3d v12 which has more capabilities that v11.

    - you want to write code that targets the version of direct3d that's available on the user's machine, but you want to take advantage of the new d3dv12 features if available.


    it would seem to me that there's definitely a need for versioning there, both at compile time AND at runtime.


    will "restrict(direct3d)" always mean v11, if so how will we restrict to v12? if not, won't that break existing code when switching compilers?

  • CharlesCharles Welcome Change

    @piersh: Herb answered the versioning question already: http://channel9.msdn.com/Events/AMD-Fusion-Developer-Summit/AMD-Fusion-Developer-Summit-11/KEYNOTE#c634440294490000000


  • @piersh: I don't hink I missed your point. Yes, like I said, we have various design options for *future* releases where versioning will be required. Herb's reply that Charles pointed you to, is one of those design options - it is not final, but shows an example (another would be compiler options for example). I pointed out that you do not need to worry about that in our first release. HTH.

  • DavidDavid

    Awesome stuff.
    Do you have any plans for making the compiler use SIMD units to optimize the parallel for_each statements too? I would find it very useful for a machine without a DX11 card to use SSE and the CPU threads as a fallback. It may also give a performance boost for situations where data copying to and from GPU is some kind of performance bottleneck.

  • @David: For SSE support, we have nothing to announce today, but stay tuned Wink

  • DavidDavid

    Saving that for an Intel event focusing on AVX? :)
    Thanks for answering.

  • CharlesCharles Welcome Change

    One more point on the portability question. Obviously, Daniel is right. However, it's important to note (again) that an open specification means, well, an open specification.... As Herb points out at the end of his AFDS keynote:

    [54:05] -> Herb says "Microsoft intends to make C++ AMP an open specification that any compiler can implement.  And we're working with our hardware partners to help them to build C++ AMP into C++ compilers for any hardware target, for any operating system target they want.  We're helping them.  And we're also pleased to announce that one of those is AMD, that AMD will be implementing C++ AMP in their FSA reference compiler for Windows and non-Windows platforms."


  • Ben ChegeBen Chege

    for those looking for open source GPU libraries http://gpgpu.org/ should provide something for you. its heavily used but bitcoin miners.

  • Knowing me knowing you, a-haaasss C++ rules and rocks!

    @DeadMG: No you're wrong. Portable means portable, which means that you can port it on more than one platform and/or hardware. If one cannot do it with your code, then your code isn't portable. Maybe it has potential to be portable but at the moment isn't. God and it is me whose english isn't first language.

  • Knowing me knowing you, a-haaasss C++ rules and rocks!

    @Daniel Moth: So basicaly what you're saying is that it is portable as long as windows is installed on a machine. Great...

  • DanielMDDanielMD Indie Game Developer

    @aasss: Portable as in WORKS on HARDWARE from DIFFERENT VENDORS. Unlike CUDA that only works on NVIDIA.

    On that note D3D is much cleaner than OGL, and you should know better than to post something like that on a MICROSOFT forum. So please do not insist unless you wanna be given the Troll Badge.

  • Very nice to have something like C++ AMP.

    I hope there will be a good .NET wrapper soon!  Smiley

  • g227g227

    The biggest stop for me now is the fact that C++ Amp does not run (or does not run well) on servers. The point is that to run C++ Amp on the server one needs to have a screen attached??
    I understand that this will be fixed in the server version of Win8 but who knows when that will be released. And then it is the question of companies deciding to move to it, etc.
    Is there a chance that this will be fixed on Win2008 R2?


  • @g227: C++ AMP runs on servers, and we have early adopters doing exactly that. If you are the same GT227 that posted on the C++ AMP MSDN forum, may I suggest keeping the discussion there?

Remove this comment

Remove this thread


Comments closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.