Posted By: Charles | May 11th, 2006 @ 1:24 PM | 43,916 Views | 13 Comments
Louis Lafreniere has been a developer on the VC++ compiler team for a long time; 15 years, to be exact. Specifically, Louis works on the backend compiler. What's a backend compiler? How's it evolved over the years? Where's it going? Watch and listen. Good stuff.
Media Downloads:
Rating:
0
0
Pon
Pon

Cool, rather interesting Smiley

I must say, I enjoy these compiler videos. Keep em' coming Big Smile

Great interview!

Concerning the ia64 architecture, there was a mention saying
the compiler had to do more of the smart to optimize code layout.
So what would be the reasoning for this change? Is this about
making the architecture simpler? (Assuming it's more complex
on other aspects).

Also appreciate a lot the improvements in back-end code generation for VC++. This is nice to see a video like this, as there are good
surprises in code generation that we could only discover by
stepping through the disassembler window.

Additions to the language, or new libraries change the way we
write code, but discovering new optimizations really gives a
different perspective. For example, the removal of the copy of an
 object being returned from a function allows the writing of code
that will do much more use of automatic variables (and therefore
will release a lot from pointer management).

I guess that someone who was writing C++ code, 10 or 15 years
ago, and now still doing so, would certainly have the feeling he/she
is using a different language, even though it's still C++.

By the way, as more developers get familiar with c# coding style,
it may be that more and more C++ classes could be written in a
header, rather than the usual .h/.cpp pair. If a Visual studio guy
reads this, this would be nice to factor this into the smart indent.



Sven Groot
Sven Groot
My name has 9 letters. Coincidence? I think not...
Interesting video. One thing though, Charles: in the beginning your wording kind of implies that this frontend/backend setup is something unique to C++, while in fact every compiler works this way. Heck, I wrote a compiler for a subset of pascal in third year Computer Science, and even that had a separate frontend and backend. Smiley I'm sure you didn't mean it like that though, it just sounded that way.

In the end you talked about making the compiler multithreaded. I think it's worth mentioning that although the compiler in VS2005 isn't, msbuild is. If you have a solution with more than one project, msbuild/VS2005 will build more than one project at the same time (if possible based on the project's dependencies) based on the number of CPUs in you system.
Hi Pierre,
Glad you liked the interview.  

Memory speed has not kept up with CPU speed increases in the past few decades.  So memory latency has become a caused a big bottleneck.  There are 2 different ways currently of approaching this problem. 

One is to really on the hardware to dynamically figure out the dependencies between instructions, and allow them to execute out-of-order as soon as their inputs are ready.  This is the approach used by most chips today.

The IA64 took a different approach, by adding flexibility in the instruction set to give tools to the compilers to schedule the instructions easily in a way such that loads can be executed far from their uses.  For example, if for "if (x) { y = *p; }", the compiler would normally not be able to hoist the load of *p outside of the if(), in case it was protecting the load to cause an exception.  IA64 provides a way to hoist this load, and differ the exception until you get inside the if().  If you don't, no exception is generated. 

For "*q = x; y = *p;", the compiler would also not normally be able to hoist the *p load above the *q store in case they point to the same address.  The IA64 however provides a way to do this load ahead, and then check at the y= if the load was invalidated by the subsequent store.

Branch misprediction is also a problem for CPUs with deep pipeline.  But the IA64 instruction can be set to be conditionaly executed based on true/false register predicates, which allows us to generate straigh line code if we want for if/else construct, avoiding the chance of mispredicted branches.

This approach does avoid a lot of the complexity of the out-of-order execution, but these tools themselves do add a lot of complexity as well.

The belief back when the IA64 was designed was that the x86 speed was approaching is peek, that out-of-order execution wouldn't be enough to avoid the memory bottleneck, and that they couldn't keep cranking up the clock speed on x86.  The though was that they would be able to crank it up higher on ia64.

But doing a good job at generating code for IA64 is a very hard problem.  Using these "tools" isn't usually free, and so they involve a lot of trade-offs.  Profile guided optimization does provide a lot of info to the compiler to help making these decisions, but it is still very hard to take full advantage of the machine.


-- Louis Lafreniere
billh
billh
call -141

Again, great video. More! You should interview some assembly language people...I would like to hear about the differences and changes over the years in the Pentium architecture and how your teams have adapted to that on very low levels. You kind of hit on that a bit with the multicore discussion here. I've thought a lot about getting back into some assembly programming just for fun (I did a fair amount of it back in the days of the 6502 chips), but am wondering how easy that will be considering the optimization that occurs on the chip itself, the caches, etc.

Question: how do you target your compiler for different Pentium architectures? From what I remember, Intel seems to alter a few instructions with every generation (from the Pentium to the Pentium II, on up to the current ones). Does your compiler recognize the user's chip and pick the best optimization? How about for programs that are shipped? How do those recognize the user's chip? Or do you not take advantage of the latest additions made by Intel?

Unfortunately, I do not own a copy of Visual Studio, so maybe those are options in the IDE, I don't know.

Hi Bill,
We are working very closely with Intel and AMD to stay on top of the latest architecture changes, and adjust/tune the compiler accordingly.

We've stopped giving customers the ability to pick which particular chip flavor they want to dirrectly target, since most people want their apps to run fast on the variety of chips on people's desk at that time.  So instead, we try to tune the compiler for the set of chips we thing will be dominent not only after we ship, but after our customer ship their own apps.  So this usually means the current chip that Intel/AMD is working on, plus the current shipping generation, and maybe the one before that as well.  We do provide the /arch:SSE and /arch:SSE2 switches to enable the compiler to use these new instructions (as well as CMOV), but the generated program will not run on the older architectures which don't support these.

Tuning the generated code (or your assembly code) is a lot harder then it used to be, mainly because of the out-of-order execution.  Back in the 386/486 and even first generation Pentiums, we used to be able to pick up the instruction manual and figure out exactly how many cycles a particular instruction sequence would take, but you can't do that anymore.  You need to know how the machine works and identify the patterns that might cause problems in the out-of-order execution.

As far as runtime detection of the architecture we run on, the CRT does look at it and take advantages of the SSE/SSE2 instruction when available to speed up some computations, and to move larger chunks of memory at a time.  The generated code from the compiler doesn't do this however.  Doing so would cause a lot of code duplication and our experience has showed that code size is very important for medium to large apps.

-- Louis Lafreniere
louisl wrote:


As far as runtime detection of the architecture we run on, the CRT does look at it and take advantages of the SSE/SSE2 instruction when available to speed up some computations, and to move larger chunks of memory at a time.  The generated code from the compiler doesn't do this however.  Doing so would cause a lot of code duplication and our experience has showed that code size is very important for medium to large apps.

-- Louis Lafreniere


How interesting. We could think the JIT should be able to take
advantage of runtime detection of the hardware to generate code
specific to the current processor. Still, as Brandon Bray was pointing
out the JIT has time constraints stricter than for a regular
compiler, and therefore cannot spend too much time optimizing.
One could also wonder how this would impact performance in
general, as most of the time the difference should be small. (?)

Are these considerations part of the Phoenix project?

louisl wrote:


As far as runtime detection of the architecture we run on, the CRT does look at it and take advantages of the SSE/SSE2 instruction when available to speed up some computations, and to move larger chunks of memory at a time.  The generated code from the compiler doesn't do this however.  Doing so would cause a lot of code duplication and our experience has showed that code size is very important for medium to large apps.

-- Louis Lafreniere


How interesting. We could think the JIT should be able to take
advantage of runtime detection of the hardware to generate code
specific to the current processor. Still, as Brandon Bray was pointing
out the JIT has time constraints stricter than for a regular
compiler, and therefore cannot spend too much time optimizing.
One could also wonder how this would impact performance in
general, as most of the time the difference should be small. (?)

Are these considerations part of the Phoenix project?

Microsoft Communities