1) Yes, the real story would compute these floats in the low 32-bits of the XMM registers. See the How it Works blog post for an explanation that addresses this 'untruth'.
2) Yes, aliasing complicates the story. However, pointers don't always prevent vectorization. In general, the auto-vectorizer will include runtime checks against aliasing. Where possible, via whole-program-optimization, it can sometimes prove the lack of aliasing, and therefore elide the runtime checks. (I mentioned aliasing a couple of times during the blog; next episode covers it in a little more depth)
3) In VS11, default floating-point calculations will use SSE instructions. Auto-vectorization will produce the same results as the scalar, SSE instructions. (Yes, results might differ between 32-bit SSE and 80-bit x87. I should have been explicit - I was comparing scalar SSE versus vector SSE versions of the program). In other cases, such as reductions, auto-vectorization CAN produce results different from scalar SSE code (due to non-associativity). But we only perform auto-vectorization for such cases under the /fp:fast flag)
4) I was speaking about this particular example - future compiler improvements should raise the speedup above 2.9X, heading towards 4X. For other loops, there are, of course, many factors that limit the speedup. (The topic of a future blog post, already drafted but not published).
I'd encourage folks to read the auto-vectorization blog - it includes about 6 posts now, allowing more time to dig into details than the brief 15-minutes available in this video.