Very interesting discussion about a lot of things, and I loved the Jackie Stewart reference.  Oh boy does the very last minute of the talk ring true.

However, it seems Martin has missed the last 10 years or so of what's happened in C++ and I had a few bones of contention about the first 20 minutes:

I think it's a mistake to worry about only the "hot" code in your programme.  The rest of the programme will still compromise overall performance, by holding up your hot code or blowing away its cache (hello garbage collector).  Also, everything adds up in a large application, especially these days where programmers tend to glue together a lot of existing functionality.

There are several points to make about Martin's claim that a JIT can optimise to the current architecture (like the count leading zeroes example, or vectorisation):
- First of all I have my doubts that JITs can actually do this that well, it just strikes me as a theoretical argument ("Well, in principle, a JIT could ...").  But if a JIT can do it efficiently to bytecode, then so can microcode to an instruction set.  We'll see, we are talking about code that we write today to run on machines tomorrow.
- You don't have to target a particular architecture (SSEn, AVXn), you can check which instructions are supported at run time and branch appropriately.  In fact the compiler might even do that for you (but check first).
- If you "compile a binary and sell it as a product", these days with automatic online updates a customer can have the latest and greatest most efficient code for their system, if it matters that much.
- The example of code that is vectorised most successfully, memcpy, is telling.  It doesn't even apply to nicely written C/C++, you only need memcpy if you have have constraints like immutable strings.
- Yes a JIT compiler of an intrepreted/managed code can make decisions about optimisation at run-time, but so can profile-guided optimisation.

Regarding the dropping into C in sensitive parts of a C++ programme because virtual functions ("polymorphic dispatch") are marginally slower than static dispatch, a good C++ coder should be able to use templates to implement so-called static polymorphism.  These function calls even have a good chance of being inlined.  Good C++ written like these is faster than straight C (doesn't have to use function pointers).

Just holding up the native code end!