Loading User Information from Channel 9
Something went wrong getting user information from Channel 9
Loading User Information from MSDN
Something went wrong getting user information from MSDN
Loading Visual Studio Achievements
Something went wrong getting the Visual Studio Achievements
Programming in the Age of Concurrency: The Accelerator Project
Aug 29, 2006 at 11:32 PMHakime -
The libraries that you mention are pre-compiled functions that use short-vector instruction sets (such as SSE3 or Altivec). For example, they include a function that does convolution. In contrast, Accelerator provides you with primitive operations that are a level below a domain-specific library function. For example, you can do element-wise addition of 2 data-parallel arrays of 1 or 2-dimensions. These operations can be used to construct domain-specific library functions, such as the convolution function.
We have a paper available on our Wiki that describes in detail the kinds of primitives that Accelerator provides and the compilation approach that we use to generate reasonably efficient GPU code.
The point of Accelerator to use data-parallelism to provide an easier way of programming GPUs and multi-cores, not to provide a set of domain-specific libraries.
You are correct that single-precision arithmetic will limit the use of GPUs for scientific computation. However, there are still lots of interesting things that you can do. You can look at http://www.gpgpu.org for more information (under "categories", look at "scientific computation"). There has also been some recent work on emulating double-precision floating point numbers using single-precision floating point numbers.
Programming in the Age of Concurrency: The Accelerator Project
Aug 27, 2006 at 8:55 PMNo, Accelerator can't use both in parallel. We wish we could
It's a neat idea, but it's a harder problem because it changes the hierarchy of the memory. You have to figure out how to partition the program across that hierarchy. With a single GPU, you are accessing the local memory on the graphics card, which is very high-bandwidth (>50 GB/s). With multiple GPUs, unless you partition the problem just right, you may need to access memory on another card across the bus. PCI-Express is fast, but not nearly as fast as the memory on the graphics card.
Programming in the Age of Concurrency: The Accelerator Project
Aug 27, 2006 at 8:38 PMActually, if we computed all the intermediate arrays implied by the high-level code, performance would be disastrous on the GPU too, because you'd use way too much memory bandwidth and destroy the spatial locality.
All of the C# for-loops end up unrolled and you end up with one large expression graph being passed to the library. The graph would imply lots of intermediate arrays being computed.
We actually convert the graph to something of the following form:
1. For each output pixel of the convolution, execute a sequential piece of code.
2. The sequential piece of code fetches the neighboring pixels and adds them together.
The sequential piece of code corresponds to the body of the pixel shader. Now, if you want good performance, you need to traverse the output pixels in the correct order to preserve spatial locality. Fortunately, the GPU traverses the output pixels in a reasonable order (these are 2-D images, after all).
Details of how we do this are described in our technical report (accessible from the Accelerator Wiki). The TR will soon be superceded by a paper that will appear in ASPLOS '06 that we hope does a better job of describing the details.
You are correct that it is quite difficult to capture the "intention" of a programmer. Our point was simple, which is that a good start would be to avoid over-specifying the behavior of the program, which is what happens if you write the code in C/C++ using for loops that specify the exact order in which individual array elements are accessed. One must wonder why Adobe had to hand-code the blocking that you describe and why a compiler couldn't do that. The answer, as you allude to, is that in the conventional high-performance computing approach, the compiler has to do some pretty heroic stuff.
To argue that other side, you could say say that our approach results in a program that is too underspecified ... the area in between overspecified and underspecified is the interesting area to investigate.
Programming in the Age of Concurrency: The Accelerator Project
Aug 27, 2006 at 8:19 PMYes, staged computation is definitely an interesting way to go. As you point out, some of the work done by the libary could be at "compile-time" (or at least earlier than currently is done).
In general, this would also fit with the LINQ work that is going on. There is some interesting work by Don Syme on connecting F# (another MSR project) to Accelerator. See Leveraging .NET Meta-programming Components from F#: Integrated Queries and Interoperable Heterogeneous Execution, to be published at the ML Workshop, 2006, Portland, Oregon, available from Don's Web page at http://research.microsoft.com/~dsyme/publications.aspx, where he describes connecting F# to Accelerator.
Singularity Revisited
Dec 02, 2005 at 11:12 PMI'm the person on the far right, wearing the dark long-sleeve shirt. Manuel is sitting to the left, then Galen, then Jim.
Singularity Revisited
Dec 02, 2005 at 10:31 PMIn addition, it allows each process to be separately garbage collected , increasing the scalability and robustness of the system. For example, you could run multiple garbage collections at the same time. You could also avoid a denial of service attack where someone is allocating lots of data, causing other processes to slow down because one garbage collector for the whole system can't keep up.
Finally, it simplifies tracking resource usage and reclaiming resources when a process ends.