You bring up a good point. Because we generate code with this sequential translation approach there are times when some optimization transformations expose new opportunities. In general we call this the "Phase Ordering Problem". The traditional approach (what we do) is to re-run the transformations that are profitable when there's compiler throughput budget to do it. (of course we can't iterate forever in addition some of the problems won’t converge)
For the predicted cycles (and code size) of a particular instruction we typically start getting a rough idea at lower time when we select a machine opcode. This becomes more concrete through register allocation and becomes very concrete in compiler terms at encoding time. The compiler selected instruction has a processor defined cost in terms of machine resources (e.g. execution elements or slots in the out of order buffer) which are fixed but then data dependencies – is the input value available at the right time or schedule - and micro-architectural issues intrude. Finally as you say the cache trumps the other issues. I think it’s important to note here that we’re getting new machines all the time with different micro-architectures, in fact much more quickly than new compilers, and the compiler needs to try and make a single executable that gives good performance across a spectrum of machines. So some of the instruction performance characteristics are hard to know ahead of time. Of course we work closely with our partners to make sure that future machines provide good performance for Microsoft apps and tools output. Finally, with respect to predictability, if you know the machine, it’s micro-arch, the working set, and execution environment (OS and workload) theoretically you can predict exactly the performance (digital computer after all) but in actuality we model a “typical” case, maintain good engineering, and then do lots of benchmarking on real world scenarios to ensure the performance of our output code.