Programming in the Age of Concurrency: The Accelerator Project
- Posted: Aug 25, 2006 at 9:58 AM
- 52,457 Views
- 32 Comments
Loading User Information from Channel 9
Something went wrong getting user information from Channel 9
Loading User Information from MSDN
Something went wrong getting user information from MSDN
Loading Visual Studio Achievements
Something went wrong getting the Visual Studio Achievements
Right click “Save as…”
Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation,
please create a new thread in our Forums,
or
Contact Us and let us know.
Follow the Discussion
Oops, something didn't work.
What does this mean?
Following an item on Channel 9 allows you to watch for new content and comments that you are interested in. You need to be signed in to Channel 9 to use this feature.What does this mean?
Following an item on Channel 9 allows you to watch for new content and comments that you are interested in and view them all on your notifications page.sign up for email notifications?
watching it now..
12:35 - hahaha I love Charles
Awesome! I'll have to download it and check it out.
I like it. I was doing something similar with this 2 years ago, except mine you had to write the PS code by hand so it wasn't nearly as cool.
I am trobled by how many times charles is the one trying to introduce groups to what other people are working on. It would be great to improve the communcation that should be going on while at the same time cutting down on the e-mail that is overwhelming you all there.
If you can improve comunication and get all these ideas to work together I see a real future with some of the things that may be possible in the next 10-15 year range. Also if microsoft can learn from it's mistakes and be more agile, have more ctp (customer technology preview) and take the feedback from that maybe microsoft can get it great in the second version instead of the third.
I wonder if its possible to take expression trees generated by LINQ and transform them into parallelisable computations. I suppose it really comes down to "map" and "reduce" functions in the end. Whilst you are kind of limited to pure arithmetic operations in the GPU, the future of multi-cores certainly could widen the scope.
Of course, I can't talk about abstract syntax trees without once again mentioning syntactic macros
Anyway... Awesome work and great video Charles.
BTW: Charles, you need to get a secondary job at MSR being "social glue"! We need to get all these academics down to the bar to mix their ideas.
How does programming with this library compare with other concurrent languages such as Occam-pi?
BTW Is that a pyramid of 14 patent cubes in the background??
- Jonathan
Thanks for the video. It should be noted, however, that one can create an array "computation" type library today. It is merely a matter of syntax and abstracting away the loops from the end user of your functions. Whether it would take advantage of a multi-core system setup is another thing entirely, so good work here.
The brief GPU discussion/explanation was also interesting. Additionally, I'm curious, however, as to what the conversion routines for data parallel arrays and regular arrays look like. Perhaps I should check out the SDK? Is much code shared in the SDK or is it pretty much a "black box" type approach?
Play with it!
C
I like the idea of having a library that makes general purpose computation on GPU really easy.
However, the example used in the video of the 5x5 convole is both illusory and hints at the core problem with the approach to making parallelism easy to code in the wider world.
If you load a big (1600x1200) image into photoshop and do a radius 5 gaussian blur, you're looking at about 1 second of processing even on my relatively old PC (AMD Athlon XP 2000). And that's because Adobe have hand-optimized their filter routines using the most efficient approach running on a CPU. That is it performs the whole matrix operation on a small area of the image at a time, it 'touches' areas of memory before it needs them so they'll be in the cache when it does need them, and of course it uses MMX/SSE to exploit the small amount of SIMD power current CPUs have.
The routine shown to us in the video takes a different approach. It composits the whole image repeatedly, offset by a given number of pixels each time. That's the definitive way to perform that operation on a GPU, but it's devastatingly inefficient to do that on a CPU compared to the conventional way of doing it (as shown by Adobe).
Now it was kind-of glossed over in the video, but I believe the interviewees were saying that they are trying to come up with a way of making data-level parallelism easy to code for both GPUs and multi-core scenarios. They also touted their library approach as being a lot simpler than the conventional high-performance computing approach of having a special compiler pick apart the loops in the problem and work out what to run where. They state that by encoding higher level operations in library calls, that the intention of the program is encoded and the library then works out what to do.
The problem there is that the intention here - to perform a 5x5 convolve by repeatedly compositing an offset image - it right for the GPU, but wrong for the CPU.
Now I suppose you could be really clever with your deferred computation and 'unpick' the intention from the series of composition calls that the nested for loops in the example produce and then work out a more efficient way to execute them on a CPU. But that's likely to only work in limited situations, where the same operation is exectuted over and over. I think it would be better to admit that there's no way to succictly encode the intention of program to a computer (this is a problem that mathematicians have grappled with since before there were computers) and just concentrate on producing useful libraries for the two different scenarios. But hey, you're the researchers!
I think the exact same thing every time. A lot of projects with overlapping goals. I know Microsoft is a big company, but a keyword searchable database of current/past projects might do wonders.
I'm not sure I fully understand the problem here. Programming concurrent applications is hard and there is no single silver bullet to make it easy (it's a hard problem). Accelerator is but one approach to a specific subset of the problem, just as Software Transactional Memory (video to appear on C9 next week) and language-level solutions are. I am just investigating what's being done around the company to address this important programming topic.
Can you elaborate more on what you see as the problem with this approach? I'm open to suggestions, as always. In fact, I'd love some more feedback.
C
(No I don't have dual cards, I just like the idea!)
Sorry that must have come out wrong. I'm not complaining that there are videos covering the similar topics.
I'm just surprised that projects with overlapping goals are unaware of each other. It seems to me that the CCR team and the Accelerator team might be able to share some useful informationg with each other. I think it's great that you are able to drop in the recommendation that they take a look at each other's solutions. Again, I'm just surprised it doesn't happen automatically.
However, and let this be my disclaimer, I'm not a Microsoft insider and further have no clue how these things work.
Yes, staged computation is definitely an interesting way to go. As you point out, some of the work done by the libary could be at "compile-time" (or at least earlier than currently is done).
In general, this would also fit with the LINQ work that is going on. There is some interesting work by Don Syme on connecting F# (another MSR project) to Accelerator. See Leveraging .NET Meta-programming Components from F#: Integrated Queries and Interoperable Heterogeneous Execution, to be published at the ML Workshop, 2006, Portland, Oregon, available from Don's Web page at http://research.microsoft.com/~dsyme/publications.aspx, where he describes connecting F# to Accelerator.
Actually, if we computed all the intermediate arrays implied by the high-level code, performance would be disastrous on the GPU too, because you'd use way too much memory bandwidth and destroy the spatial locality.
All of the C# for-loops end up unrolled and you end up with one large expression graph being passed to the library. The graph would imply lots of intermediate arrays being computed.
We actually convert the graph to something of the following form:
1. For each output pixel of the convolution, execute a sequential piece of code.
2. The sequential piece of code fetches the neighboring pixels and adds them together.
The sequential piece of code corresponds to the body of the pixel shader. Now, if you want good performance, you need to traverse the output pixels in the correct order to preserve spatial locality. Fortunately, the GPU traverses the output pixels in a reasonable order (these are 2-D images, after all).
Details of how we do this are described in our technical report (accessible from the Accelerator Wiki). The TR will soon be superceded by a paper that will appear in ASPLOS '06 that we hope does a better job of describing the details.
You are correct that it is quite difficult to capture the "intention" of a programmer. Our point was simple, which is that a good start would be to avoid over-specifying the behavior of the program, which is what happens if you write the code in C/C++ using for loops that specify the exact order in which individual array elements are accessed. One must wonder why Adobe had to hand-code the blocking that you describe and why a compiler couldn't do that. The answer, as you allude to, is that in the conventional high-performance computing approach, the compiler has to do some pretty heroic stuff.
To argue that other side, you could say say that our approach results in a program that is too underspecified ... the area in between overspecified and underspecified is the interesting area to investigate.
No, Accelerator can't use both in parallel. We wish we could
It's a neat idea, but it's a harder problem because it changes the hierarchy of the memory. You have to figure out how to partition the program across that hierarchy. With a single GPU, you are accessing the local memory on the graphics card, which is very high-bandwidth (>50 GB/s). With multiple GPUs, unless you partition the problem just right, you may need to access memory on another card across the bus. PCI-Express is fast, but not nearly as fast as the memory on the graphics card.
(or perhaps You find the limited Ps instruction set easier to start out with)
Parallel data and parallel instructions are two different beasts I guess. Trying to operate on a single dataset from multiple processors causes all kinds of memory/cache issues. When you can split the data up and work independently then its fine. However when you can't, the only performant way to operate is in one processor. In this case taking advantage of the data parallelism inside a single GPU.
Of course, I'm not an expert by any means in this area... hopefully the boffins at MSR are finding clever solutions to these tricky problems.
You're quite right, rhm, that different target platforms have different issues, and that you have to adapt the structure of your program to your processor if you want the comparison to be meaningful. I can assure you that in our convolution benchmark, the CPU version we compare against is quite clever about how it iterates.
For our multi-core backend, we are indeed being as ambitious as you suggest. Our goal is to tailor the loop ordering to suit the machine. There have been decades of research into automatic loop transformations (strip-mining, tiling, skewing, ...), so the idea of doing this in a compiler isn't novel. As David points out, the advantage we have is that the program is specified at a higher level, so we don't have to burn cycles trying to figure out which transformations we can legally apply without breaking a data dependency.
Sidd
Hakime -
The libraries that you mention are pre-compiled functions that use short-vector instruction sets (such as SSE3 or Altivec). For example, they include a function that does convolution. In contrast, Accelerator provides you with primitive operations that are a level below a domain-specific library function. For example, you can do element-wise addition of 2 data-parallel arrays of 1 or 2-dimensions. These operations can be used to construct domain-specific library functions, such as the convolution function.
We have a paper available on our Wiki that describes in detail the kinds of primitives that Accelerator provides and the compilation approach that we use to generate reasonably efficient GPU code.
The point of Accelerator to use data-parallelism to provide an easier way of programming GPUs and multi-cores, not to provide a set of domain-specific libraries.
You are correct that single-precision arithmetic will limit the use of GPUs for scientific computation. However, there are still lots of interesting things that you can do. You can look at http://www.gpgpu.org for more information (under "categories", look at "scientific computation"). There has also been some recent work on emulating double-precision floating point numbers using single-precision floating point numbers.
Thanks!
And then a comment: my problem with offloading stuff to the GPU is that the numerical environment is a joke. You don't know anything about the radix, the range of fp values, if +, -, *, /, and sqrt follows the sane rounding rules of IEEE, controlling any reorderings or fusions (i.e. a*b+c -> fma(a,b,c)) that are allowed to take place, NaNs, -0, Inf (if so, affine or projected?), what happens on overflow/underflow/etc., all the nice functions in the latest draft IEEE standard or C99, controlling directed rounding, etc., etc., etc.
It's fine for doing things like CoreImage or accelerating game physics, but not for some of the things I'd love to offload that requires careful analysis. I hope you guys nag the DX people (at least) for some pragmas or mode that will tighten up the fp environment and/or the ability to set/query anything interesting (see limits.h or float.h from C99 as an example).
And the use of functional type stuff scares me. Do you guys automatically break data down into smaller tiles to keep the memory usage more manageable?
When I run your Life program then switch to the task manager the code errors out:
=====================================
See the end of this message for details on invoking
just-in-time (JIT) debugging instead of this dialog box.
************** Exception Text **************
Error in the application.
-2005530520 (D3DERR_DEVICELOST)
at Microsoft.DirectX.Direct3D.Device.GetRenderTargetData(Surface renderTarget, Surface destSurface)
at AcceleratorDX.DxMachine.ConvertStreamToArray(AcceleratorFloatStream s1, Single[,]& afl)
at AcceleratorDX.DxMachine.ConvertStreamToBitmap(AcceleratorStream s1, Bitmap& bmp)
at Microsoft.Research.DataParallelArrays.ParallelArrays.ToBitmap(FloatParallelArray a, Bitmap& bm)
at LifeDemo.Display(Graphics g, Rectangle rc) in C:\Program Files\Microsoft\Accelerator\samples\life.cs:line 61
at LifeWindowsForm.OnPaint(PaintEventArgs e) in C:\Program Files\Microsoft\Accelerator\samples\life.cs:line 152
at System.Windows.Forms.Control.PaintWithErrorHandling(PaintEventArgs e, Int16 layer, Boolean disposeEventArgs)
at System.Windows.Forms.Control.WmPaint(Message& m)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ScrollableControl.WndProc(Message& m)
at System.Windows.Forms.ContainerControl.WndProc(Message& m)
at System.Windows.Forms.Form.WndProc(Message& m)
at System.Windows.Forms.Control.ControlNativeWindow.OnMessage(Message& m)
at System.Windows.Forms.Control.ControlNativeWindow.WndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)
************** Loaded Assemblies **************
mscorlib
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/Microsoft.NET/Framework/v2.0.50727/mscorlib.dll
----------------------------------------
Life
Assembly Version: 0.0.0.0
Win32 Version: 0.0.0.0
CodeBase: file:///C:/Program%20Files/Microsoft/Accelerator/samples/Life/bin/Debug/Life.exe
----------------------------------------
System.Windows.Forms
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Windows.Forms/2.0.0.0__b77a5c561934e089/System.Windows.Forms.dll
----------------------------------------
System
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System/2.0.0.0__b77a5c561934e089/System.dll
----------------------------------------
System.Drawing
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Drawing/2.0.0.0__b03f5f7f11d50a3a/System.Drawing.dll
----------------------------------------
System.Configuration
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Configuration/2.0.0.0__b03f5f7f11d50a3a/System.Configuration.dll
----------------------------------------
System.Xml
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Xml/2.0.0.0__b77a5c561934e089/System.Xml.dll
----------------------------------------
Accelerator
Assembly Version: 1.1.1.2
Win32 Version: 1.1.1.2
CodeBase: file:///C:/Program%20Files/Microsoft/Accelerator/samples/Life/bin/Debug/Accelerator.DLL
----------------------------------------
Microsoft.DirectX.Direct3D
Assembly Version: 1.0.2902.0
Win32 Version: 9.05.132.0000
CodeBase: file:///C:/WINDOWS/assembly/GAC/Microsoft.DirectX.Direct3D/1.0.2902.0__31bf3856ad364e35/Microsoft.DirectX.Direct3D.dll
----------------------------------------
Microsoft.DirectX
Assembly Version: 1.0.2902.0
Win32 Version: 5.04.00.2904
CodeBase: file:///C:/WINDOWS/assembly/GAC/Microsoft.DirectX/1.0.2902.0__31bf3856ad364e35/Microsoft.DirectX.dll
----------------------------------------
Microsoft.DirectX.Direct3DX
Assembly Version: 1.0.2906.0
Win32 Version: 9.07.239.0000
CodeBase: file:///C:/WINDOWS/assembly/GAC/Microsoft.DirectX.Direct3DX/1.0.2906.0__31bf3856ad364e35/Microsoft.DirectX.Direct3DX.dll
----------------------------------------
Microsoft.VisualC
Assembly Version: 8.0.0.0
Win32 Version: 8.00.50727.42
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/Microsoft.VisualC/8.0.0.0__b03f5f7f11d50a3a/Microsoft.VisualC.dll
----------------------------------------
************** JIT Debugging **************
To enable just-in-time (JIT) debugging, the .config file for this
application or computer (machine.config) must have the
jitDebugging value set in the system.windows.forms section.
The application must also be compiled with debugging
enabled.
For example:
<configuration>
<system.windows.forms jitDebugging="true" />
</configuration>
When JIT debugging is enabled, any unhandled exception
will be sent to the JIT debugger registered on the computer
rather than be handled by this dialog box.
=====================================
Also, where is the code to monitor cpu performance and gpu performance. How to enable? Are there any managed C++ samples
yet?
Thanks,
Chuck
In light of projects like LINQ offering up expression trees that can now be interpreted and compiled into a completely different language and/or transferred off to be executed on a totally diff. piece of hardware, I'm kinda hoping this project picked up with that and basically implemented LINQ to GPUs.
I started wondering when we'd see this with specific respect to WPF and a true approach to writing custom shader effects when I realized that LINQ could enable this kind of capability. I finally got around to writing a blog post about it and somebody alerted me to this project.
Anyway, love to hear where the project stands!
Cheers,
Drew
super awsome

id just like to bump this thread and ask how this project is doing now?
how is this related to the shader stuff in wpf thats coming up?
chales, a new interview with these guys whould be so cool
@aL_
+1
With LINQ one can write
var grayImage = GPU.Compute(image,(Float4 color)=> {
return color.R * 0.4 + color.G*0.3 + color.B*0.3);
});
without Accelerator's array proxies.
With improved Expressions in v4.0 link one can expect very convenient api...
Any news?
Remove this comment
Remove this thread
close