Entries:
Comments:
Posts:

Loading User Information from Channel 9

Something went wrong getting user information from Channel 9

Latest Achievement:

Loading User Information from MSDN

Something went wrong getting user information from MSDN

Visual Studio Achievements

Latest Achievement:

Loading Visual Studio Achievements

Something went wrong getting the Visual Studio Achievements

GoingNative 7: VC11 Auto-Vectorizer, C++ NOW, Lang.NEXT

Download

Right click “Save as…”

In this installment of GoingNative, it's all about the latest C++ compiler technology from Microsoft. Most of the time is spent discussing VC11's Auto-Vectorizer with a few short forays into other VC compiler improvements (like Auto-Parallelizer). You meet the lead developer for VC11's backend compiler and the architect of Auto-Vectorizer, Jim Radigan (who spends all the time at the whiteboard). You also meet backend compiler PM Jim Hogg, a C9 veteran and one of the original folks behind the Phoenix Compiler Project.

In order to keep the conversation palatable to a large number of folks, we don't get into the math behind auto-vectorization. However, if this is something that really interests you, then we can get Jim to do a lecture on the internals (will take more than one session, of course—a lot of stuff goes on behind the scenes when you take a loop of arbitrary complexity and determine if it's vectorizable and then vectorize it with maximum efficiency...). Now, on to AutoVec.

The VC11 compiler includes a feature called Auto-Vectorization, or AutoVec for short. AutoVec tries to make loops in your code run faster by using the SSE, or vector, registers present in all current processors. The feature is on by-default. So, like other optimizations that the compiler performs, you don't need to know anything more to benefit.  However, this session explains more background on what is going on, and digs a little into the kinds of sophisticated analyses that AutoVec performs, and the loop patterns that it successfully speeds up.

Here's a trivial example of a loop that gets automatically vectorized in VC11 with significant performance gains:

int i = 0;
for (i=0; i<100000; i++)
{
    a[i] = b[i] + c[i];
}

The auto-vectorizer transforms the above tight loop into machine instructions that run the loop 4x faster on SIMD-capable (SSE/SSE2) processors. As Jim and Jim discuss, this is because each loop iteration simultaneously  performs 4 computations using the modern CPU's vector registers. This is a great automatic optimization feature in VC11. Tune in and meet a couple of the key folks behind VC's Auto-Vec!

Table of Contents:

[00:00] Diego and Charles construct the show (C++NOW, Some news, Auto-Vectorizer in VC11 compiler)
[04:45] Charles interviews VC backend compiler lead developer Jim Radigan and backend compiler PM Jim Hogg
[52:03] Diego and Charles destruct the show (longer than usual, but worth the delay - Lang.NEXT, C&B 2012, C++ + XAML + DX, WRL Documentation)

Tags:

Follow the Discussion

  • This is very interesting. And while I would like to see how my existing code would benefit from this, the fact that VS11 will not target Win XP will prevent me from using this in production code. It's really a shame. Maybe in a few years I'll finally be able to start using a lot of the new features VS11 C++ compiler is offering. I agree that talking with the compiler guys is very informative!
  • , HomerSimpian wrote

    This is very interesting. And while I would like to see how my existing code would benefit from this, the fact that VS11 will not target Win XP will prevent me from using this in production code. It's really a shame. Maybe in a few years I'll finally be able to start using a lot of the new features VS11 C++ compiler is offering. I agree that talking with the compiler guys is very informative!

    Be fair, Windows XP is more than 11 years old now. Back then if you wanted parallel computing in the box you went for some horribly expensive single core, SMP system. We've come a long way since then. You have to vote folks off the island eventually if you want the platform to go anywhere, otherwise you end up with a flat growth curve... wait a minute... Perplexed

  • CharlesCharles Welcome Change

    Please don't turn this thread into more no-VC11-targetting-XP gripes. It has nothing to do with the topic at hand. You can engage the XP topic all you care to on reddit, etc... Not here. OK?

    Please focus - and ask questions - on AutoVec, AutoP, and other VC11 compiler optimizations.

    Thanks,
    C

  • fair enough. My gripe was that as much as I want to use this stuff, I cannot in a work setting as we have to target XP.  But, don't let that stop the progress in the compiler.  One day I'll be using it....just not soon enough Smiley  

    I really do enjoy these topics and am happy that MS is tackling them.  How does AutoP work with C++Amp?  Does AutoV/P work C++11 iterations?  Does the compiler recognize only the 'for' keyword, or does it recognize other loop constructs such as 'while' or std::for_each()?

     

  • Very interesting show.  It's always good to get more performance-for-free features tossed in.

    I, for one, would very much like to see some special sessions taking a deep dive into the internals of auto-vectorization and the tricks used to vectorize code that doesn't obviously lend itself to that process.  The nature of my work often presents me with loop dependencies that do not vectorize/parallelize well.  Anything you can provide, either through uman communication or through tooling, to help programmers code in more readily-vectorizable styles would be of value too.

    Whatever happened to project Phoenix anyway?  We haven't heard of it in a long time.  I assume it has been assimilated into other projects.  We can see pieces of it in things like Rosalyn, but it seems like there was a lot more to it that hasn't popped back up.

     

  • @Homer: AutoVec benefits any C++ code compiled with the VS11 compiler.  So, those parts of an AMP application that run on the CPU (ie, non-kernel routines) get that benefit.  As always, your mileage will vary: simple compute-bound loops that match the patterns that AutoVec recognizes can gain a good boost in performance; others will see no change.  How much the performance improves overall? - see "Amdahl's Law"

    AutoVec will also recognize those while loops that can be transformed into an equivalent for loop.  For example: 

    int n = 0; 
    while (n < 99) 
    { 
       a[n] += b[n]; 
       ++n; 
    } 
    

    will vectorize nicely.

    @ryanb: we are preparing a series of blogs that dig into some of AutoVec's "internals".  In particular, the shape of loops that AutoVec works on, and why.

    Phoenix? - yes, it still exists, and is being worked on, but absorbed into other projects.

    Jim

     

  • Great interviews!

    Definitely, more compiler guys in the future sounds fantastic! Smiley

    I'll use this opportunity to ask a few questions -- feel free to sneak them in the future interviews, although answers in the comments aren't bad at all, of course! I'll try to tie them up to the video for better context Smiley


    0.  Loop vectorization limitations and requirements (the example for-loop; also reduce/sum around 26:50)
    Are there any  divisibility requirements (XMM is 128 bit, what if we have more/less data than the amount divisible by this) -- is there support for auto duplication/truncation/padding or is such an irregular loop (i.e., one with data length not satisfying the divisibility property) removed from the optimizer's considerations for now or are there techniques to handle this?

    1. Beyond {SSE(1), SSE2, AVX}; in particular, SSE4.1 (mentioned around 22:50)

    a. Are there any plans for the SSE4.1 support?
    I admit I'm asking since I have a vested interest here, numerical linear algebra is quite useful in what I do and there are some interesting instructions in this set, such as:
    DPPS, DPPD (dot product a.k.a. inner product) // they're useful for a lot of applications, in fact: http://www.virtualdub.org/blog/pivot/entry.php?id=150
    // correspond to _mm_dp_ps, _mm_dp_pd intrinsics --  http://msdn.microsoft.com/en-us/library/bb514034%28v=vs.110%29.aspx

    Intel did a mini-benchmark some time ago demonstrating a speed-up of the dot product computation in the examples; already the SSE3 version (using HADDPS) was 26% faster, while SSE4 version (using DPPS) was 72% faster than the base case:
    http://www.intel.com/technology/itj/2008/v12i3/3-paper/6-examples.htm

    This makes SSE4.x very exciting, AutoVec support would be great here!

    b. Another, perhaps more far-reaching question -- if/when this gets supported, will there be an integration with STL, such as, say, std::inner_product would automatically make use of the above instructions where applicable?


    2. Comparison of the current features and future evolution thereof with GCC (benchmarking w/ GCC mentioned around 41:50)

    Some topics of interest:
    a. GCC Graphite comparison -- how does the AutoVec fare, relatively?

    Example: http://openwall.info/wiki/internal/gcc-local-build#Parallel-processing-with-GCC
    Features/flags" -floop-parallelize-all -ftree-parallelize-loops=8

    There's a nice discussion of some topics for GCC that could be interesting to relate to:
    - limitations // http://gcc.gnu.org/wiki/Graphite/Parallelization
    - behind the scenes // http://gcc.gnu.org/wiki/Graphite?action=AttachFile&do=view&target=graphite_lambda_tutorial.pdf
    What were the analogous implementation choices and the resulting limitations in AutoVec?

    // out of curiosity -- is the polyhedral model /* http://en.wikipedia.org/wiki/Frameworks_supporting_the_polyhedral_model */ also employed by AutoVec or is it something different here?

    b. Profile Mode // "Goal: Give performance improvement advice based on recognition of suboptimal usage patterns of the standard library."

    This is actually pretty cool and integrates nicely with C++ STL -- e.g., if you try a sub-optimal insertion pattern with std::Vector you'll have a nice, human-readable advice suggesting std::list:
    http://gcc.gnu.org/onlinedocs/libstdc++/manual/profile_mode.html#manual.ext.profile_mode.using

    Is there a similar feature in plans?


    3. Compilation back-end parallelization and inlining

    Does the parallel compilation work with inlining? For instance in the discussed case of {main-foo-a1, main-bar-a2} call tree (around 39:20), if "foo" gets inlined or "a2" gets inlined (note the depth change) does the compiler have to recompile it in any of these cases?


    4. Devirtualization (around 40:50) -- limits/changes.

    Some devirtualization was already available a while ago: http://msdn.microsoft.com/en-us/magazine/cc301407.aspx
    What are the most interesting changes in the current release / what limits have been pushed / what limits remain?


    Once again, thanks for the great episode!

  • STLSTL

    > Another, perhaps more far-reaching question -- if/when this gets supported, will there be an integration with STL, such as, say, std::inner_product would automatically make use of the above instructions where applicable?

    One of my todos is to go through the STL and look for opportunities to make our algorithms more friendly to the autovectorizer.  I believe that this will be mechanically very simple, if all that's necessary is detecting raw pointers at compile-time (either all raw pointers, or just raw pointers to scalars) and using index-based instead of pointer-based loops.  Such compile-time logic is trivial in the Standard Library (we have tons and tons of it already, some to call memmove()/etc.), it's just that the STL has a lot of algorithms and auditing them to figure out which ones would benefit will take time.  I won't be able to get to this for VC11 because it's "nice to have" and I've been dealing with must-fix issues, but it is definitely on my radar.

    > Is there a similar feature in plans [to libstdc++'s profile mode]?

    I've heard of that, but I haven't heard of user experiences with it, nor am I convinced that it is actually useful in practice.

  • Windows XP is 11 years old, and VC is finally dropping support for it.

    SSE2 is also 11 years old, but VC is only starting to support vectorization.

    This begs the question: What changes in the landscape prompted Microsoft to make autovectorization a priority for VC11 now in 2012? Compared to other optimizations, why was the addition of autovectorization into VC not justified in previous releases? Especially considering that even the now 10-year-old Intel C++ 7.0 supported this performance feature.

    (20'30") VC11 recognizes non-unit stride array references. Does this imply that VC11 implements gather/scatter-style vectorization (movsd/movlpd + movhpd)?

    (23'30") VC11 is capable of replacing a loop with library calls. Besides memset/memcpy, what other idioms are recognized?

    (30'00") VC11 has an equivalent of Intel's SVML for vectorizing transcendental functions. What functions are covered? What are their accuracy in terms of ulps? How do they compare against SVML in performance? SVML requires the default floating point enviroment (rounding mode, etc.). Does VC11 have the same limitation?

    (35'20") The issue of data alignment was brought up. Does VC11 generate multiple versions of a loop when data alignment is unknown or simply use load/store sequences (movsd + movhpd or movupd) for unaligned data? How does it cater to microachitecture characteristics of processors of different generations and/or from different vendors, especially when their instruction set support is identical? For example, assuming only SSE2 support, pre-Nehalem Intel processors have great latencies with unaligned accesses, but K10+ AMD processors have no problem with that.

    (37'00") Charles asked about targeting GPU using pure C++ without language extensions. What is Microsoft's vision on OpenACC? Furthermore, OpenMP is expected to absorb OpenACC when the latter matures. Is Microsoft considering going beyond OpenMP 2.0 and supporting more declarative parallel programming models?

    (45'00) SPEC2006 was mentioned. How does VC11's generated code performance fare against state-of-the-art vectorizing compilers such as Intel C++ in SPEC2006? Besides benchmarks, how much benefit does autovectorization bring about when compiling Microsoft products?

     

  • felix9felix9 the cat that walked by itself

    Maybe you could talk about MSTest-Native in the future. Smiley

  • felix9felix9 the cat that walked by itself

    , abcs wrote

    This begs the question: What changes in the landscape prompted Microsoft to make autovectorization a priority for VC11 now in 2012?

    perhaps the same reason as the whole 'C++ Renaissance', we need more performance when targeting mobile devices like phones.

     P.S. I believe Phoenix belongs to David Tarditi now.

  • C HusonC Huson

    Hi,

    Are you using runtime disambiguation of pointers? I remember that was a primary roadblock to our vectorizer targeting C and C++ ages ago. In your example above, if a, b and c are parameters of a method, you have to detect overlaps. (Same thing true of auto-parallelization with relaxed memory models.) I was always surprised at the tricks people play with code.

    Fun to see someone else doing this after so long.

    Regards,
    Chris

  • Excellent interview and discussion.

    I'm encouraged about Jim's near-slip of getting a v2 of AV/AP out as quickly as possible after Dev 11 (good agreement with Herb Sutter's hints that C++ interim releases may be coming).

    We harped on perf being important and this proves you listened.

    The tooling and logs Mr Hogg imagines near the end will be ciritical to helping us know whether AV/AP is in effect or how far off my code is from playing (ex. imagine a warning like "Autovectorization skipped for loop ( line 245) in foo() because variable ptrA is not aligned on a 4 byte boundary...")

  • ChrisChris

    I would like to point out that the compiler team has nothing to do with the CRT and MFC not supporting XP. That's the libraries team. However the library developers don't randomly make any decisions so the managers are responsible.

    Anyway, nice episode.

  • RobertRobert

    I apologize if this is a stupid question, but this seems to imply that some level of thread synchronization will be needed. For example, if I've overloaded the index operator and it modifies some shared state, that member variable would need to be made atomic, otherwise the vectorization would try to modify it from four simultaneous locations. Am I misunderstanding this? I've never used something like this, so I can only imagine it functioning like four threads, and thus needing some level of synchronization.

  • @Matt.  Good questions.  I'll answer the easy ones, and leave the deeper ones for the upcoming blogs.

    0.  "divisibility".  AutoVec takes care of that.  So if we have a loop over an int[999 ] array, we vectorize the first 996 iterations into 996/4, and tidy-off with a scalar loop over the 3 elements left.

    1 a.  SSE4.1.  Maybe a glitch in what we said.  AutoVec does already make use of some SSE4.1 instructions.

    1 b.  See Stephan's reply above

    2.  GCC.  I think this was a misunderstanding that we did not explain clearly.  We were not talking about comparisons with GCC's auto-vectorizer.  Just talking about one of the tests in the Spec2006 benchmark, which is a compilation of some (old? - don't know) version GCC.

    3.  Inlining.  If the inliner phase has decided that a particular callsite should be inlined, then AutoVec sets to work upon the function it is given.  In this sense, inlining and AutoVec work together.

    Jim

  • C Huson: Yes AutoVec of course includes runtime checks for aliasing (eg, where arrays a, b or c partially overlap).  C++ declares that such overlaps are legal, and so we HAVE to do this, else vectorization would give wrong results.

    Jim

  • @Robert.  Nice question.  The explanation is a little involved.  Here goes:

    Compilers typically divide into two parts: a frontend that translates source text, such as C++, into some intermediate representation of the original program.  VC++ uses tuples (think annotated, binary assembly code) at the intermediate rep.  And a backend, that consumes the tuples, optimizes and generates corresponding machine code.

    So the original code might use std::vector<T>.  But the backend sees just tuples - where the C++ abstraction has been "lowered" to its concrete representation: a C-array, with a few extra locations used to track current size and capacity, and a method that's called when required to grow the array.  AutoVec can work on this, just as well as if it had been given a raw C-array.

    The same holds true for more exotic cases, such as when a user overloads the index operator to do something fancy.  By the time the code reaches the backend, it's been reduced to equivalent tuples.  We attempt to vectorize those tuples.  That attempt either proves successful, and correct.  Or it is not attempted, but still correct.  All optimizations, including AutoVec, are always conservative, and thereby safe.

    Jim

     

  • @abcs.  Answers to some questions.  Some others I'll defer to future blogs.

    Does AutoVec perform gather/scatter?  Yes, under some circumstances.  For example, loops that reference a field in an array of structs.

    Vectorized math library?  Yes, we cover all of the functions in "math.h"

    Unaligned references?  We generate those versions of SSE instructions that support unaligned access.  (As I'm sure you know, checking statically which refs are aligned, in order to elide runtime alignment checks, is challenging).  Yes, glad to note that Nehalem+ microarchitectures have reduced the hit for unaligned accesses.

    Jim

  • How does autoP compose with concrt? Will there be over subscription?

  • @jimhogg: does VC correctly recognize the "exact overlap with restrict" idiom (even with tools like PGO)?

    void vadd1(T * restrict dest, const T * restrict src, const size_t n)
    {
        const T * s = dest == src ? dest : src;
    
        for (size_t i = 0; i != n; ++i) {
            *dest++ += *s++;
        }
    }


    s is either based on dest or based on src meaning vadd1 either supports non-overlapping or exact overlapping ranges (no partial overlap). If you wrote this instead:

    void vadd2(T * restrict dest, const T * restrict src, const size_t n)
    {
        for (size_t i = 0; i != n; ++i) {
            *dest++ += *src++;
        }
    }


    This code would be undefined when you pass in two exact overlapping ranges (i.e. vadd2 only supports non-overlapping ranges). In other words Visual Studio is not allowed to rewrite vadd1 as:

    const T * s = dest == src ? src : src;
    const T * s = src;


    Because these expressions no longer carry the "based on dest" as the original vadd1 does and would be undefined for exact overlap.

  • JayJay

    Nice Work! I was wondering how auto-parallelizer works. Does it create multiple threads and partition the workload over them. If so how do you reduce the overhead of creating threads at every loop that is candidate for auto-parallelizer?

  • CharlesCharles Welcome Change

    @Jay: We didn't spend much time on AutoP, but I'd imagine Jim and company will be blogging about it. Poor Jim has many questions to answer - and a lot of real work to do, too Smiley

    C

  • Good to know software finally making use of hardware that was available a decade ago.

    In the video, Jim R seem to imply that this helps Intel, ARM processors. Why did he miss AMD?

  • @msdivy - auto-vectorization works on AMD hardware too.  Just did not say so, since AMD and Intel chip architectures are so very similar.  (There is a handful of instructions unique to either, which we avoid generating)

  • I read that article that he mentioned at the beginning...great stuff.

  • WaldemarWaldemar

    That was extremely cool!

  • Very interesting talk.  I'd love to see this worked into .NET as well.  Performance for free is always welcome.

  • So there was one thing I was thinking about during this whole interview which I don't think was addressed.  Basically, from what I've observed in the past, not all CPU instructions are created equal.  Some have a higher cost than others.

    I don't doubt that vectorizing and fitting multiple operations into a single specialized instruction is faster.  But is performing 4 addition operations (for example) in one instruction really 4 times faster than doing them one at a time?

  • @Steve Wortham . . . "But is performing 4 addition operations (for example) in one instruction really 4 times faster than doing them one at a time?"

    No!  As you suspect, the speedup achieved depends on several factors, including:

    1. The overhead of "loop mechanics" - increment counter, compare with limit, branch.  (Regular loop-unroll optimization attempts to reduce this very factor)
    2. How much computation goes on in the loop body.  If large, then it dominates item 1, improving the effective speedup.  Eg: float computation is heavier than analogous int computation - both vectorize, but the effective speedups differ
    3. Cache misses: if the arrays are large. then L1 cache miss can totally negate the optimizations otherwise achieved by vectorization

    I'll add these as issues to take up later, in the blogs

     

  • @Derp.  This question ranges more broadly than auto-vectorization.  And I'm not sure I follow the question exactly.  But, some points/questions:

    • "restrict" is a C99 keword.  However, several/most C++ compilers have provided an analogous construct for several years.  For MSVC, it's __restrict (or __declspec(restrict))
    • with that nit out of the way, your first example specifies dest and src with restrict.  The compiler is not obligated to do anything with this assertion, but it may.  And if you call vadd1 with arrays that overlap, partially or in total, you just broke the restrict contract.  So compiler behavior is undefined - you may get the answer you would like; you may get an answer you dislike!

    Maybe I've misunderstood?  Certainly, if the first example did not use restrict, then I could see our discussion would be very different.

  • "we don't get into the math behind auto-vectorization. However, if this is something that really interests you, then we can get Jim to do a lecture on the internals"


    Please do this! I would be very interested!

  • LostInSpacebarAdityaG OMG VISTA FTW LOLZ!!1one

    Keep the C++ vids coming! 

    P.S. I love how awesomely cheesy Charles is sometimes... VECTORIZE!

  • felix9felix9 the cat that walked by itself

    btw, I believe Ted Neward talked a lot about the 'programing languages renaissance' around 2008

  • @jimhogg - Thanks for the response.  That makes some sense.

    So I'm sure you've thought about this.  But seeing as how C++ is used heavily in so many tools, libraries, and applications, the performance improvements you're enabling with these techniques are trickling down to power efficiency.  So you and your team are doing more to save the planet than Al Gore.  That's gotta feel good.

  • Jim, Jim and Charles - thanks for a great talk! I'd love to hear more from the compiler makers.

    Can argument decoration aid the vectorizer? 

    Maybe decorating an argument as __declspec(align(16)) can make the vectorizer load registers with aligned instructions?

    Does decorating arguments with __restrict have *any* impact currently?   Can you give an example of the effect it has?

  • Dave AbrahamsDave Abrahams

    Thanks for the mention of this year's non-profit C++Now! conference. I just wanted to mention to everyone that this year, in addition to general talks on advanced C++ and the Boost-related material, we have three keynote speakers and one whole week of C++11 tutorials. It's going to be epic.

  • CharlesCharles Welcome Change

    , Ofek_Shilon wrote

    Jim, Jim and Charles - thanks for a great talk! I'd love to hear more from the compiler makers.

    Can argument decoration aid the vectorizer? 

    Maybe decorating an argument as __declspec(align(16)) can make the vectorizer load registers with aligned instructions?

    Does decorating arguments with __restrict have *any* impact currently?   Can you give an example of the effect it has?



    Jim Radigan has agreed to another (deep) interview on the VC backend compiler. Filming in May. Thanks Jim!

    C

  • fdsffdsf

    Nice work but. When did Intel release the MMX extensions to the x86 architecture? 1996! How comes this has taken so long for this kind optimization to be implemented. Are these optimizations going to be picked up any time soon by the CLR team?

  • How this vectorization handling alignment problems? What happens if I have an array not aligned to 16 bytes, can it still create some code to use SSE?

  • @Lrdx

    Already discussed above:  "Unaligned references?  We generate those versions of SSE instructions that support unaligned access"

    Plumbing alignment checks into the compiler is not straightforward.  For example, a caller may correctly align his array, arr say, on a 16-byte boundary, but then call a function and pass it an argument of &arr[1].   Suddenly the callee must handle a pointer that is no longer 16-byte aligned.  The callee can check alignment at runtime, ok; but, back at compile time, has to weigh whether to generate SSE instructions that assume unaligned, or SSE instructions assume aligned (faster) - resulting in two versions of the code.

  • Diego VillagraDiego Villagra

    hi, when can i find a good c++2010 tutorial, i have knowledge in programming in c++, but i would like to increase my knowledge. Is there any c++ online tutorial you provide for free on the internet i can go and check it out to start learning more and in the future succeed at creating my applications?

    Thank you very much for all your time!!
    Diego Villagra

  • @jimhogg:(I'm assuming the compiler extension __restrict uses the same exact semantics as specified in C99).

    T foo;
    vadd1(&foo, &foo, sizeof(foo));

    is NOT undefined. The C99 standard doesn't care what's assigned to a restricted pointer, it only cares about tracking the flow of "expressions based on restricted pointers" when it comes time to deference them. Pay attention to the variable s in vadd1. Using the semantics specified in C99, s is treated as if it were "an expression based on the restricted pointer dest" or "an expression based on the restricted pointer src".

    e.g. This code is fine:
    T * restrict a = &whatever;
    T * b = a;
    *a = *b;

    But this code is not:
    T * restrict a = &whatever;
    T * restrict b = a;
    *a = *b;

    Because in the first case, b is treated as "an expression based on the restricted pointer a". In the second case, b is a new restrict pointer (and this only matters at deference time, not at assignment time, i.e. if the last statement *a = *b weren't there, both would be fine).

    In the vadd1 example, we assign s (which is NOT a restrict pointer but an expression based on a restrict pointer) using the ternary operator to make it "an expression based on the restricted pointer src" or "an expression based on the restricted pointer dest" which is subtle trick that makes this code valid for exact overlap.

    So my 4 questions were:

    1. Are the members of the Visual C++ compiler team aware of the "exact overlap using restrict" idiom and understand the subtlety it's based on?
    2. Is the Visual C++ optimizer written to NOT transform:
      T * restrict a = whatever1;
      const T * restrict b = whatever2;
      const T * c = (a == b ? a : b);
      *a = *c;

      As if you had written:
      T * restrict a = whatever1;
      const T * restrict b = whatever2;
      const T * c = b;
      *a = *c;

      What I mean is, does the Visual C++ optimizer algebraically simplify the expression (a == b ? a : b) to just b? If so, this is an invalid transformation because even though the values compare equal, the "based on-ness" of c would change from "either based on a or based on b" to "based on b" which would be wrong for exact overlap.

      In other words, C99 added an implicit property to pointers called "based on-ness" which compilers can use to augment its alias analysis code. If a compiler aggressively optimizes using restrict, it must carefully keep track of an expression's "based on-ness" in its optimization passes and NOT assume that because 2 pointer expressions evaluate to the same value, their based on-ness is the same too.

    3. If the team is aware of this, do they recognize this code as the programmer intending to inform the compiler that the code they are working on only handles exact overlap or complete non-overlap?

      In other words, I expect the compiler to recognize this idiom, elide the expression (a == b ? a : b) and then generate fast code which handles both exact overlap and complete non-overlap (and does something undefined with partial overlap).
    4. Is PGO aware of this idiom? I know it's notorious for screwing up tricky but valid code.

    My questions basically boil down to me wondering a. does Visual C++ handle "based on-ness" correctly in its optimization passes (especially PGO) b. does Visual C++ correctly handle the exact overlap idiom c. does Visual C++ generate great code for the exact overlap idiom (and if it generates great code, is it because it correctly recognized the exact overlap idiom or is it because it botched an optimization pass and generated code assuming no overlap which just happened to also work with exact overlap)?

  • @Derp - sorry for the late reply.  Let me see if I am following you right:

    If you are asking whether MSVC gets the right answer for the following snippet, the answer is yes - for both a Debug (/Od) and Release (/O2) build.  ie, it correctly handles both no-overlap, and exact-overlap.

    I'm not sure whether you are concerned that MSVC produces wrong answer in the presence of __restrict (we don't know of any).  Or whether we ignore opportunities for optimizations (as permitted by the standard) that __restrict makes possible?

    Jim

     

    int a[] = {1,2,3}; int b[] = {4,5,6};
    vadd1(a, b, 3); vadd1(a, a, 3); vadd1(b, b, 3);

     
  • Robin DaviesRobin Davies

    Great feature! I'd very much like to see some breakdown on what can and cannot be vectorized. Conditionals? Strictly linear arithmetic? How far can this be pushed?

  • @Robin Davies:

    I've started a blog that discusses auto-vectorization in more depth

  • @jimhogg:

    > If you are asking whether MSVC gets the right answer for the following snippet, the answer is yes - for both a Debug (/Od) and Release (/O2) build.  ie, it correctly handles both no-overlap, and exact-overlap.

    Okay but did VC generate code which handles exact overlapping because the compiler detected the exact overlap idiom or did the compiler generate code which was based on the assumption that the 2 arrays don't overlap at all but which just happens to work for exact overlap by coincidence?

    You can't answer this question by examining the assembly generated. You can only answer it by examining the passes inside the compiler and/or by asking the team members who implemented these passes if they're aware of the exact overlap idiom and if the passes they wrote implemented it correctly.

    e.g. if you ask a random compiler team member, "what is vadd1 supposed to do or why it was written that way?" could they answer "it adds arrays which either overlap exactly or don't overlap at all" and fully understand why?

    > I'm not sure whether you are concerned that MSVC produces wrong answer in the presence of __restrict (we don't know of any).  Or whether we ignore opportunities for optimizations (as permitted by the standard) that __restrict makes possible?

    My concern is that the standard uses very tricky wording and the exact overlap idiom is not that obvious so I'm wondering if VC generated the correct code by design or by coincidence. I'm also trying to make the VC team members aware of this subtlety in the standard so they can continue to generate fast and correct code in the future.

  • Are there any plans for auto-vectorization in .NET JIT?

  • VotorilmtumVotorilmtum


    http://l2wunderkind.10.forumer.com/viewtopic.php?p=495701#495701
    http://www.austintabletennis.com/index.php?option=com_k2&view=item&id=43:stance&Itemid=36
    http://oliver9616.xanga.com/763468627/onitzuka-tiger/
    http://l2wunderkind.10.forumer.com/viewtopic.php?p=495743#495743
    http://laufschuhe538.page.tl/joggingschuhe.htm
    http://www.livelogcity.com/users/lictors9508
    http://tabernaborat.mundoforo.com/viewtopic.php?p=54369#54369
    http://cruisingandlivingaboard.net/forum/viewtopic.php?f=66&t=124345
    http://charlie3581.insanejournal.com/2334.html
    http://stargate.podyskutuj.pl/asic-tiger-t12841.html
    http://asmodeus6406.xanga.com/763598854/tiger-onitsuka/

Remove this comment

Remove this thread

close

Comments Closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums,
or Contact Us and let us know.