SIMD + C# = Parallelism on a Single Core
This isn't something I usually highlight, but when Martin Woodward pointed me at this post, I thought you all might dig it.
Eoin Mullan has written up a great post on using a new .NET 4.6 feature that you might have over looked or thought wasn't for you...
When Microsoft shipped .NET 4.6 last summer they also released a new 64-bit JIT compiler named RyuJIT. The main goal was to improve the load times of 64 bit applications, but it also allows developers to get more performance from modern processors via SIMD intrinsics. This post looks at what SIMD intrinsics are, how RyuJIT enables .NET developers to take advantage of them, some useful patterns for using SIMD in C#, and what sort of gains you can expect to see. A follow up post takes a more detailed low level look at how it works.
The Basics of SIMD (Single Instruction Multiple Data)
A CPU carries out its job by executing instructions, and the specific instructions that a CPU knows how to execute are defined by the instruction set (e.g. x86, x86_64) and instruction set extensions (e.g. SSE, AVX, AVX-512) that it implements. Many SIMD instructions are available via these instruction set extensions.
There are more details on typical expected gains below, and the follow up post looks at performance in more detail.
Just In Time (JIT) Compilation.
Developers who compile directly to native code (e.g. using C and C++) have had access to SIMD intrinsics for a while, but the problem is that you need to know at compile time which instructions are available on your target machine. You need to be certain that the instructions in your binary are available on the target processor and so you must use only the subset of instructions common to all targets, or perform run-time checks before executing certain sequences. This is where a managed language’s JIT compiler is well placed.
C# (and all .NET languages) are compiled to an intermediate language called Common Intermediate Language (CIL, a.k.a. Microsoft Intermediate Language, MSIL), which is deployed to the target machine. When the application is loaded the local .NET JIT compiler compiles the CIL to native code. The upshot here is that the JIT compiler knows exactly what type of CPU the target machine has and so it can make full use of all instruction sets available to it. The new JIT compiler is, therefore, the key to making SIMD available to .NET developers.
Another advantage of accessing SIMD via the JIT compiler is that your application’s performance will improve in the future without ever being rebuilt and re-deployed....
Using SIMD in C# code
The simplest and recommended way to use SIMD is via the classes and static methods in the System.Numerics namespace. Note, you need at least version 220.127.116.11 of the System.Numerics and System.Numerics.Vectors assemblies. If they didn’t ship with your version of .NET then grab the System.Numerics.Vectors package from NuGet. Ensure, also, that “Optimize code” is enabled for your project, as is the default for release builds.
Writing simple custom SIMD algorithms
It’s fairly straight forward to write your own custom SIMD algorithms using
Vector<T>where the generic type parameter can be any primitive numeric type. This class holds as many values of type
Tas can fit into the target machine’s SIMD register. Operations on objects of this type use SIMD intrinsics whenever possible. In order to write custom algorithms you just need to know how many values of the given type you can fit into the current processor’s register, and this is available in the
More advanced SIMD algorithms
Finally, to illustrate a more advanced custom algorithm imagine you need to perform real-time image analysis as efficiently as possible. Let’s say you have a stream of 4k images (3840 x 2160) and you need to find the darkest and brightest pixel values, and the average pixel brightness for each image. This information could inform an auto-exposure algorithm and form the basis of more detailed analysis.
For clarity I’ll stick with monochrome, 16 bits per pixel images (this could easily be applied to 3 colour channels). I’ll show the max/min algorithm first, and then combine the average into it. Since the pixel values are 16-bit we’ll use
Vector<ushort>, which allows simultaneous processing of e.g. 16 values on a processor with AVX2.
Minimum and maximum of an array
Minimum, maximum and average of an array
To find the average pixel value we first need to calculate the total of all pixels in the image and then divide by the number of pixels. As with finding the minimum and maximum we could use
Vector.Add<ushort>()to find the total of each sub-array, but we would first need to convert the values to
ulongs to avoid arithmetic overflows. This overhead wipes out any advantage of using SIMD and such an algorithm is, in fact, much slower than a simple non-SIMD addition loop. It turns out that SIMD can’t help find the average and the solution is to combine a non-SIMD algorithm into the above max/min code like so:
Since the release of .NET 4.6, developers have been able to take advantage of SIMD intrinsics without needing to know anything about the processor on which the code will be executed or invoking any specific SIMD instructions. Both the amount of complexity this adds to your code and the performance gains that can be achieved vary depending on the specific scenario. Considerable gains can often be achieved without too much effort and so it’s worth knowing where and how to use SIMD. As hardware and the .NET JIT compiler improve in the future so too will the performance of SIMD algorithms written today without ever being rebuilt and re-deployed.