Back to Profile: jimhogg


  • Getting the most out of the C++ compiler


    No, I'm afraid the auto-vectorizer doesn't use AVX in this first release.  (But high up on our TODO list!)

    In VS2010, the /arch:SSE and /arch:SSE2 switches, on x86, tell the compiler to use XXM registers for floating-point calculations, rather than x87 registers.  Similarly, the /arch:AVX switch tells the compiler to emit AVX instructions, rather than SSE instructions.  But in all those cases, it emits just SCALAR instructions.  It is only with the advent of the auto-vectorizer that the compiler emits SIMD instructions that make full use of the wide XMM vector registers.

    *AX, *BX.  Yes, sorry.  No excuse!  (As I replied to Bruce, above, I was more careful with this example in the blog)

  • Getting the most out of the C++ compiler


    1)  Yes, the real story would compute these floats in the low 32-bits of the XMM registers.  See the How it Works blog post for an explanation that addresses this 'untruth'.

    2) Yes, aliasing complicates the story.  However, pointers don't always prevent vectorization.  In general, the auto-vectorizer will include runtime checks against aliasing.  Where possible, via whole-program-optimization, it can sometimes prove the lack of aliasing, and therefore elide the runtime checks.  (I mentioned aliasing a couple of times during the blog; next episode covers it in a little more depth)

    3)  In VS11, default floating-point calculations will use SSE instructions.  Auto-vectorization will produce the same results as the scalar, SSE instructions.  (Yes, results might differ between 32-bit SSE and 80-bit x87.  I should have been explicit - I was comparing scalar SSE versus vector SSE versions of the program).  In other cases, such as reductions, auto-vectorization CAN produce results different from scalar SSE code (due to non-associativity).  But we only perform auto-vectorization for such cases under the /fp:fast flag)

    4)  I was speaking about this particular example - future compiler improvements should raise the speedup above 2.9X, heading towards 4X.  For other loops, there are, of course, many factors that limit the speedup.  (The topic of a future blog post, already drafted but not published).

    I'd encourage folks to read the auto-vectorization blog - it includes about 6 posts now, allowing more time to dig into details than the brief 15-minutes available in this video.

  • Getting the most out of the C++ compiler


    Here are brief answers:

    • Does the array size need to be known at compile time? - no. but indexing into that array is limited to forms like [i * K + j] where K must be a compile-time constant.  If you are flattening a 2-D array onto one dimension, then K appears as the row-length.
    • Do you need to use the index syntax (array[i]) or can you use pointers? What about iterators?  Pointers work.
    • Can the compiler vectorize operations on a std::vector<> ?  Sometimes.
    • Are there any operations that will prevent the compiler from vectorizing? ex. branching, trigonometry functions etc.  Conditionals, yes.  Trig functions, no.
    • If the compiler detects a cross-iteration dependency on one of the many operations in a loop, will it split the work in one vectorized loop and one scalar loop?  For certain patterns, yes.  But I'd shy away from saying we've conquered this in first release.

    As Charles mentioned, please checkout the blog - it answers most of these topics in more detail.  (Just published episode 6 earlier today)

  • Getting the most out of the C++ compiler

    @Granville Barnett:

    Phoenix lives on quietly.  A slimmed-down version of the framework was adopted into a Microsoft-internal project.

  • GoingNative 7: VC11 ​Auto-​Vectorizer, C++ NOW, Lang.NEXT

    @Robin Davies:

    I've started a blog that discusses auto-vectorization in more depth

  • GoingNative 7: VC11 ​Auto-​Vectorizer, C++ NOW, Lang.NEXT

    @Derp - sorry for the late reply.  Let me see if I am following you right:

    If you are asking whether MSVC gets the right answer for the following snippet, the answer is yes - for both a Debug (/Od) and Release (/O2) build.  ie, it correctly handles both no-overlap, and exact-overlap.

    I'm not sure whether you are concerned that MSVC produces wrong answer in the presence of __restrict (we don't know of any).  Or whether we ignore opportunities for optimizations (as permitted by the standard) that __restrict makes possible?



    int a[] = {1,2,3}; int b[] = {4,5,6};
    vadd1(a, b, 3); vadd1(a, a, 3); vadd1(b, b, 3);

  • GoingNative 7: VC11 ​Auto-​Vectorizer, C++ NOW, Lang.NEXT


    Already discussed above:  "Unaligned references?  We generate those versions of SSE instructions that support unaligned access"

    Plumbing alignment checks into the compiler is not straightforward.  For example, a caller may correctly align his array, arr say, on a 16-byte boundary, but then call a function and pass it an argument of &arr[1].   Suddenly the callee must handle a pointer that is no longer 16-byte aligned.  The callee can check alignment at runtime, ok; but, back at compile time, has to weigh whether to generate SSE instructions that assume unaligned, or SSE instructions assume aligned (faster) - resulting in two versions of the code.

  • GoingNative 7: VC11 ​Auto-​Vectorizer, C++ NOW, Lang.NEXT

    @Derp.  This question ranges more broadly than auto-vectorization.  And I'm not sure I follow the question exactly.  But, some points/questions:

    • "restrict" is a C99 keword.  However, several/most C++ compilers have provided an analogous construct for several years.  For MSVC, it's __restrict (or __declspec(restrict))
    • with that nit out of the way, your first example specifies dest and src with restrict.  The compiler is not obligated to do anything with this assertion, but it may.  And if you call vadd1 with arrays that overlap, partially or in total, you just broke the restrict contract.  So compiler behavior is undefined - you may get the answer you would like; you may get an answer you dislike!

    Maybe I've misunderstood?  Certainly, if the first example did not use restrict, then I could see our discussion would be very different.

  • GoingNative 7: VC11 ​Auto-​Vectorizer, C++ NOW, Lang.NEXT

    @Steve Wortham . . . "But is performing 4 addition operations (for example) in one instruction really 4 times faster than doing them one at a time?"

    No!  As you suspect, the speedup achieved depends on several factors, including:

    1. The overhead of "loop mechanics" - increment counter, compare with limit, branch.  (Regular loop-unroll optimization attempts to reduce this very factor)
    2. How much computation goes on in the loop body.  If large, then it dominates item 1, improving the effective speedup.  Eg: float computation is heavier than analogous int computation - both vectorize, but the effective speedups differ
    3. Cache misses: if the arrays are large. then L1 cache miss can totally negate the optimizations otherwise achieved by vectorization

    I'll add these as issues to take up later, in the blogs


  • GoingNative 7: VC11 ​Auto-​Vectorizer, C++ NOW, Lang.NEXT

    @msdivy - auto-vectorization works on AMD hardware too.  Just did not say so, since AMD and Intel chip architectures are so very similar.  (There is a handful of instructions unique to either, which we avoid generating)

  • GoingNative 7: VC11 ​Auto-​Vectorizer, C++ NOW, Lang.NEXT

    @abcs.  Answers to some questions.  Some others I'll defer to future blogs.

    Does AutoVec perform gather/scatter?  Yes, under some circumstances.  For example, loops that reference a field in an array of structs.

    Vectorized math library?  Yes, we cover all of the functions in "math.h"

    Unaligned references?  We generate those versions of SSE instructions that support unaligned access.  (As I'm sure you know, checking statically which refs are aligned, in order to elide runtime alignment checks, is challenging).  Yes, glad to note that Nehalem+ microarchitectures have reduced the hit for unaligned accesses.


  • GoingNative 7: VC11 ​Auto-​Vectorizer, C++ NOW, Lang.NEXT

    @Robert.  Nice question.  The explanation is a little involved.  Here goes:

    Compilers typically divide into two parts: a frontend that translates source text, such as C++, into some intermediate representation of the original program.  VC++ uses tuples (think annotated, binary assembly code) at the intermediate rep.  And a backend, that consumes the tuples, optimizes and generates corresponding machine code.

    So the original code might use std::vector<T>.  But the backend sees just tuples - where the C++ abstraction has been "lowered" to its concrete representation: a C-array, with a few extra locations used to track current size and capacity, and a method that's called when required to grow the array.  AutoVec can work on this, just as well as if it had been given a raw C-array.

    The same holds true for more exotic cases, such as when a user overloads the index operator to do something fancy.  By the time the code reaches the backend, it's been reduced to equivalent tuples.  We attempt to vectorize those tuples.  That attempt either proves successful, and correct.  Or it is not attempted, but still correct.  All optimizations, including AutoVec, are always conservative, and thereby safe.