If i look to this video i can not prevent myself to think that there is nothing new here and i am quite surprised to see that Microsoft is so late in this
I mean Apple has been working on APIs for SIMD programming for many years that provide data parallelism for image processing, scientific application, signal processing, math computing, etc..... This API is called Accelerate framework and it just do all the
job for the developper
The libraries that you mention are pre-compiled functions that use short-vector instruction sets (such as SSE3 or Altivec). For example, they include a function that does convolution. In contrast, Accelerator provides you with primitive operations that
are a level below a domain-specific library function. For example, you can do element-wise addition of 2 data-parallel arrays of 1 or 2-dimensions. These operations can be used to construct domain-specific library functions, such as the convolution
We have a paper available on our Wiki that describes in detail the kinds of primitives that Accelerator provides and the compilation approach that we use to generate reasonably efficient GPU code.
The point of Accelerator to use data-parallelism to provide an easier way of programming GPUs and multi-cores, not to provide a set of domain-specific libraries.
You are correct that single-precision arithmetic will limit the use of GPUs for scientific computation. However, there are still lots of interesting things that you can do. You can look at
http://www.gpgpu.org for more information (under "categories", look at "scientific computation"). There has also been some recent work on emulating double-precision floating point numbers using single-precision floating
What happens for those lucky people with dual video cards? Can Accelerator use both in parallel?
(No I don't have dual cards, I just like the idea!)
No, Accelerator can't use both in parallel. We wish we could
It's a neat idea, but it's a harder problem because it changes the hierarchy of the memory. You have to figure out how to partition the program across that hierarchy. With a single GPU, you are accessing the local memory on the graphics card, which is very
high-bandwidth (>50 GB/s). With multiple GPUs, unless you partition the problem just right, you may need to access memory on another card across the bus. PCI-Express is fast, but not nearly as fast as the memory on the graphics card.
It's an interesting project and an interesting video.
I like the idea of having a library that makes general purpose computation on GPU really easy.
However, the example used in the video of the 5x5 convole is both illusory and hints at the core problem with the approach to making parallelism easy to code in the wider world.
If you load a big (1600x1200) image into photoshop and do a radius 5 gaussian blur, you're looking at about 1 second of processing even on my relatively old PC (AMD Athlon XP 2000). And that's because Adobe have hand-optimized their filter routines using the
most efficient approach running on a CPU. That is it performs the whole matrix operation on a small area of the image at a time, it 'touches' areas of memory before it needs them so they'll be in the cache when it does need them, and of course it uses MMX/SSE
to exploit the small amount of SIMD power current CPUs have.
The routine shown to us in the video takes a different approach. It composits the whole image repeatedly, offset by a given number of pixels each time. That's the definitive way to perform that operation on a GPU, but it's devastatingly inefficient to do that
on a CPU compared to the conventional way of doing it (as shown by Adobe).
Now it was kind-of glossed over in the video, but I believe the interviewees were saying that they are trying to come up with a way of making data-level parallelism easy to code for both GPUs and multi-core scenarios. They also touted their library approach
as being a lot simpler than the conventional high-performance computing approach of having a special compiler pick apart the loops in the problem and work out what to run where. They state that by encoding higher level operations in library calls, that the
intention of the program is encoded and the library then works out what to do.
The problem there is that the intention here - to perform a 5x5 convolve by repeatedly compositing an offset image - it right for the GPU, but wrong for the CPU.
Now I suppose you could be really clever with your deferred computation and 'unpick' the intention from the series of composition calls that the nested for loops in the example produce and then work out a more efficient way to execute them on a CPU. But that's
likely to only work in limited situations, where the same operation is exectuted over and over. I think it would be better to admit that there's no way to succictly encode the intention of program to a computer (this is a problem that mathematicians have grappled
with since before there were computers) and just concentrate on producing useful libraries for the two different scenarios. But hey, you're the researchers!
Actually, if we computed all the intermediate arrays implied by the high-level code, performance would be disastrous on the GPU too, because you'd use way too much memory bandwidth and destroy the spatial locality.
All of the C# for-loops end up unrolled and you end up with one large expression graph being passed to the library. The graph would imply lots of intermediate arrays being computed.
We actually convert the graph to something of the following form:
1. For each output pixel of the convolution, execute a sequential piece of code. 2. The sequential piece of code fetches the neighboring pixels and adds them together.
The sequential piece of code corresponds to the body of the pixel shader. Now, if you want good performance, you need to traverse the output pixels in the correct order to preserve spatial locality. Fortunately, the GPU traverses the output pixels in a
reasonable order (these are 2-D images, after all).
Details of how we do this are described in our technical report (accessible from the Accelerator Wiki). The TR will soon be superceded by a paper that will appear in ASPLOS '06 that we hope does a better job of describing the details.
You are correct that it is quite difficult to capture the "intention" of a programmer. Our point was simple, which is that a good start would be to avoid over-specifying the behavior of the program, which is what happens if you write the code in C/C++ using
for loops that specify the exact order in which individual array elements are accessed. One must wonder why Adobe had to hand-code the blocking that you describe and why a compiler couldn't do that. The answer, as you allude to, is that in the conventional
high-performance computing approach, the compiler has to do some pretty heroic stuff.
To argue that other side, you could say say that our approach results in a program that is too underspecified ... the area in between overspecified and underspecified is the interesting area to investigate.
Defered evalution is definately a very interesting subject. The work being done with LINQ is along the same lines. The compiler generates an expression tree that can then be passed around as data and transformed before evaluation. I wonder if its possible to take expression trees generated by LINQ and transform them into parallelisable computations. I suppose it really comes down to "map" and "reduce" functions in the end. Whilst you are kind of limited to pure arithmetic operations
in the GPU, the future of multi-cores certainly could widen the scope.
Of course, I can't talk about abstract syntax trees without once again mentioning syntactic macros
It would be interesting to look at using syntactic macros to perform staged computation. I'm sure some of the parallelising of operations can be decided at compile time. That could make for even more performance
increases since you can take some weight off the JIT compiler.
Yes, staged computation is definitely an interesting way to go. As you point out, some of the work done by the libary could be at "compile-time" (or at least earlier than currently is done).
Having a separate GC per SIP lets you choose the an appropriate garbage collector for each application. For example, some GCs are real-time (they have low pause times), whereas other GCs emphasize throughput, at the expense of higher pause times. You
might want to use a real-time collector for streaming video and a GC with higher throughput for a compute-intensive application.
In addition, it allows each process to be separately garbage collected , increasing the scalability and robustness of the system. For example, you could run multiple garbage collections at the same time. You could also avoid a denial of service attack where
someone is allocating lots of data, causing other processes to slow down because one garbage collector for the whole system can't keep up.
Finally, it simplifies tracking resource usage and reclaiming resources when a process ends.