Great interviews!
Definitely, more compiler guys in the future sounds fantastic! 
I'll use this opportunity to ask a few questions -- feel free to sneak them in the future interviews, although answers in the comments aren't bad at all, of course! I'll try to tie them up to the video for better context 
0. Loop vectorization limitations and requirements (the example for-loop; also reduce/sum around 26:50)
Are there any divisibility requirements (XMM is 128 bit, what if we have more/less data than the amount divisible by this) -- is there support for auto duplication/truncation/padding or is such an irregular loop (i.e., one with data length not satisfying the divisibility property) removed from the optimizer's considerations for now or are there techniques to handle this?
1. Beyond {SSE(1), SSE2, AVX}; in particular, SSE4.1 (mentioned around 22:50)
a. Are there any plans for the SSE4.1 support?
I admit I'm asking since I have a vested interest here, numerical linear algebra is quite useful in what I do and there are some interesting instructions in this set, such as:
DPPS, DPPD (dot product a.k.a. inner product) // they're useful for a lot of applications, in fact: http://www.virtualdub.org/blog/pivot/entry.php?id=150
// correspond to _mm_dp_ps, _mm_dp_pd intrinsics -- http://msdn.microsoft.com/en-us/library/bb514034%28v=vs.110%29.aspx
Intel did a mini-benchmark some time ago demonstrating a speed-up of the dot product computation in the examples; already the SSE3 version (using HADDPS) was 26% faster, while SSE4 version (using DPPS) was 72% faster than the base case:
http://www.intel.com/technology/itj/2008/v12i3/3-paper/6-examples.htm
This makes SSE4.x very exciting, AutoVec support would be great here!
b. Another, perhaps more far-reaching question -- if/when this gets supported, will there be an integration with STL, such as, say, std::inner_product would automatically make use of the above instructions where applicable?
2. Comparison of the current features and future evolution thereof with GCC (benchmarking w/ GCC mentioned around 41:50)
Some topics of interest:
a. GCC Graphite comparison -- how does the AutoVec fare, relatively?
Example: http://openwall.info/wiki/internal/gcc-local-build#Parallel-processing-with-GCC
Features/flags" -floop-parallelize-all -ftree-parallelize-loops=8
There's a nice discussion of some topics for GCC that could be interesting to relate to:
- limitations // http://gcc.gnu.org/wiki/Graphite/Parallelization
- behind the scenes // http://gcc.gnu.org/wiki/Graphite?action=AttachFile&do=view&target=graphite_lambda_tutorial.pdf
What were the analogous implementation choices and the resulting limitations in AutoVec?
// out of curiosity -- is the polyhedral model /* http://en.wikipedia.org/wiki/Frameworks_supporting_the_polyhedral_model */ also employed by AutoVec or is it something different here?
b. Profile Mode // "Goal: Give performance improvement advice based on recognition of suboptimal usage patterns of the standard library."
This is actually pretty cool and integrates nicely with C++ STL -- e.g., if you try a sub-optimal insertion pattern with std::Vector you'll have a nice, human-readable advice suggesting std::list:
http://gcc.gnu.org/onlinedocs/libstdc++/manual/profile_mode.html#manual.ext.profile_mode.using
Is there a similar feature in plans?
3. Compilation back-end parallelization and inlining
Does the parallel compilation work with inlining? For instance in the discussed case of {main-foo-a1, main-bar-a2} call tree (around 39:20), if "foo" gets inlined or "a2" gets inlined (note the depth change) does the compiler have to recompile it in any of these cases?
4. Devirtualization (around 40:50) -- limits/changes.
Some devirtualization was already available a while ago: http://msdn.microsoft.com/en-us/magazine/cc301407.aspx
What are the most interesting changes in the current release / what limits have been pushed / what limits remain?
Once again, thanks for the great episode!