louisl louisl

Niner since 2006


  • Louis Lafreniere: Next Generation Buffer Overrun Protection with /GS++

    Plain GS frames are pretty easy to find in disassembly.  Plain GS frames look like:

    sub     esp, 16
    mov     eax, DWORD PTR ___security_cookie
    xor       eax, ebp|esp
    mov     DWORD PTR __$ArrayPad$[ebp|esp], eax

    The scheduler can sometimes interleave some instructions in there.  EH frames are quite a bit trickier to find if compiled for size however, because we use helper calls (like __EH_prolog3_GS) to setup/unlink the frames.  But you could look for the helper code in the image (there are multiple versions to look for), and search for calls to it.  Depending on coding styles though, GS frames can be pretty rare...  Some code doesn't need stack buffers or local structs.  So not finding one doesn't mean the code isn't compiled with /GS.

          -- Louis Lafreniere

  • Louis Lafreniere: Next Generation Buffer Overrun Protection with /GS++

    Microsoft does have an internal tool which groups are required to run before shipping binaries.  This tool ensures several things, and one of these is that /GS was enabled on each modules.  It also requires the binaries to be compiled by a certain minimum compiler version.  So once Dev10 ships and the tool sets the minimum bar to Dev10, it will guarantee all Microsoft products are compiled with /GS++.

    This tool isn't available externally AFAIK, but someone could easily write their own.  The tool looks at the .pdb file.  Using DIA, you could look to make sure each module has /GS using IDiaSymbol::get_hasSecurityChecks().

            -- Louis Lafreniere

  • Ale Contenti and Louis Lafreniere: ​Understandi​ng Exceptions and When/How to Handle Them

    The problem with adding a static EH model on x86 is that the new code wouldn't be able to interact with code compiled by a previous VC compiler, or from another compiler vendor.

    For a static EH model, you need the ability to unwind the stack 100% reliably.  The loose calling conventions defined by x86 Win32 do not make this possible.  The debugger certainly tries, but it can't do it in 100% of the cases.  We'd need to add some new rules and info to the image to allow this, but any code not following these rules couldn't be unwound.  This in my mind diminishes the value of the feature.

    For a static model to work, it needs to be defined in the ABI from the start, like it was for Win64 on IA64 and x64.

    BTW, static EH models don't quite have zero-cost.  The possibility of exceptions in an application adds new possible control flow which affects optimizations.  You could call it low-cost! [A]

    Louis Lafreniere
  • Ale Contenti and Louis Lafreniere: ​Understandi​ng Exceptions and When/How to Handle Them

    I'll try to answer your questions as best as I can:

    Is some big buffer allocated into which they are placed, on a per-stack basis (in which case, who allocates it?

    When an object is thrown by value, the compiler emits a copy constructor to copy it to the argument stack of throw().  If the catch catches by value, the object argument of the throw will then be copy constructed by the CRT to the argument stack of the catch.

    What if I stupidly use CreateThread instead of _beginthreadex)?

    Ale could probably better answer this one, but in general you should always use _beginthread or _beginthreadex when using the CRT to make sure the CRT internal data structures are properly initialized.  I don't recall that EH uses any global data structure, but I could certainly be wrong here.

    Is the exception constructed first and copied there (or else what happens when an exception is thrown during exception object construction - when is the exception considered thrown)?

    I think I've answered the first part of your question.  For the second part, the exception is considered "thrown" once the throw() has called RtlRaiseException().  So in the scenario above, if the first copy-ctor throws, the first exception had not been officially thrown yet, so the new exception takes over.  If the second copy-ctor throws (copying to the catch), the CRT will catch the exception and call terminate().

    How big can an exception object be?

    As long as it can fit on the stack I believe!  That's if you throw and/or catch by value of course. 

    If I do something silly to increase the alignment of a class, is that handled?

    Since we don't support __declspec(align) on parameters, the answer is no.

    How do exception frames work within exception handlers? So on.

    Are you asking about how we handle the case where there is a try/catch inside a catch block?  The catch handler actually shares the same EH frame/tables as the parent function.  On Win64, the handler of course have separate unwind entries, but do share the IP (instruction pointer that is Smiley ) to EH state table.

    I hope this helps!

    EH is a very broad subject and there are many facets to it which could each be a talk on its own: perf, usability and best practices, security, under the hood implementation, etc.  I think we tried to cover the most important part of each of these (it's true that we didn't touch security though), but we do need to keep these videos down to a reasonable lenght.

    -- Louis Lafreniere

  • Ale Contenti and Louis Lafreniere: ​Understandi​ng Exceptions and When/How to Handle Them

    Yes the EH state is stored as 32 bits on Win32.  You need a new state for each new C++ object, and for each new try block you enter.  If you overflowed that state, things would go very wrong.

    However, there are many other limits you would hit before hitting this one.  The compiler has limits on how big a function can be, how many objects it can have, how many curlies can be opened, etc. 

    Even if you could compile your program, you would run out of stack space to store all these objects, and your program wouldn't fit in memory (a state update is more then a byte). Smiley

    -- Louis Lafreniere

  • Louis Lafreniere - VC++ backend compiler

    Yes the JIT throughput is very important, still instruction selection is quick to do and this would be quite appropriate for a JIT.  The win though wouldn't be very big, and I could be wrong but I don't believe our JITs do any optimization dependent on the host CPU.

    We are currently working on the high level optimizations right now on Phoenix, and will tune the low level machine dependent code generation later on.  This is certainly something we'll consider if we see opportunities.

       -- Louis Lafreniere
  • Louis Lafreniere - VC++ backend compiler

    Hi Bill,
    We are working very closely with Intel and AMD to stay on top of the latest architecture changes, and adjust/tune the compiler accordingly.

    We've stopped giving customers the ability to pick which particular chip flavor they want to dirrectly target, since most people want their apps to run fast on the variety of chips on people's desk at that time.  So instead, we try to tune the compiler for the set of chips we thing will be dominent not only after we ship, but after our customer ship their own apps.  So this usually means the current chip that Intel/AMD is working on, plus the current shipping generation, and maybe the one before that as well.  We do provide the /arch:SSE and /arch:SSE2 switches to enable the compiler to use these new instructions (as well as CMOV), but the generated program will not run on the older architectures which don't support these.

    Tuning the generated code (or your assembly code) is a lot harder then it used to be, mainly because of the out-of-order execution.  Back in the 386/486 and even first generation Pentiums, we used to be able to pick up the instruction manual and figure out exactly how many cycles a particular instruction sequence would take, but you can't do that anymore.  You need to know how the machine works and identify the patterns that might cause problems in the out-of-order execution.

    As far as runtime detection of the architecture we run on, the CRT does look at it and take advantages of the SSE/SSE2 instruction when available to speed up some computations, and to move larger chunks of memory at a time.  The generated code from the compiler doesn't do this however.  Doing so would cause a lot of code duplication and our experience has showed that code size is very important for medium to large apps.

    -- Louis Lafreniere
  • Louis Lafreniere - VC++ backend compiler

    Hi Pierre,
    Glad you liked the interview.  

    Memory speed has not kept up with CPU speed increases in the past few decades.  So memory latency has become a caused a big bottleneck.  There are 2 different ways currently of approaching this problem. 

    One is to really on the hardware to dynamically figure out the dependencies between instructions, and allow them to execute out-of-order as soon as their inputs are ready.  This is the approach used by most chips today.

    The IA64 took a different approach, by adding flexibility in the instruction set to give tools to the compilers to schedule the instructions easily in a way such that loads can be executed far from their uses.  For example, if for "if (x) { y = *p; }", the compiler would normally not be able to hoist the load of *p outside of the if(), in case it was protecting the load to cause an exception.  IA64 provides a way to hoist this load, and differ the exception until you get inside the if().  If you don't, no exception is generated. 

    For "*q = x; y = *p;", the compiler would also not normally be able to hoist the *p load above the *q store in case they point to the same address.  The IA64 however provides a way to do this load ahead, and then check at the y= if the load was invalidated by the subsequent store.

    Branch misprediction is also a problem for CPUs with deep pipeline.  But the IA64 instruction can be set to be conditionaly executed based on true/false register predicates, which allows us to generate straigh line code if we want for if/else construct, avoiding the chance of mispredicted branches.

    This approach does avoid a lot of the complexity of the out-of-order execution, but these tools themselves do add a lot of complexity as well.

    The belief back when the IA64 was designed was that the x86 speed was approaching is peek, that out-of-order execution wouldn't be enough to avoid the memory bottleneck, and that they couldn't keep cranking up the clock speed on x86.  The though was that they would be able to crank it up higher on ia64.

    But doing a good job at generating code for IA64 is a very hard problem.  Using these "tools" isn't usually free, and so they involve a lot of trade-offs.  Profile guided optimization does provide a lot of info to the compiler to help making these decisions, but it is still very hard to take full advantage of the machine.

    -- Louis Lafreniere