Rossj wrote:
Okee dokee spoke briefly to Jeremy and his line is essentially
"Don't trust benchmarks that are sponsored by vendors whether that be Microsoft OR the Samba team".
So here's a
third party benchmark instead.
Fair 'nuf. Now here's the question that follows from this out-of-the-box comparison. If out-of-the-box, one server implements a perf feature that is unsafe (like ignoring requests to commit data to disk), but the other one doesn't, 95% of the time, you'll not notice, 5% of the time, you'll lose data (I'm making these numbers up). In that case, your benchmark will show that the server that doesn't flush is faster (it's doing fewer disk writes), but that's because the server isn't following the contract.
Is that a fair test?
At one point, someone far smarter than I said "If I don't have to follow the specification, I can make a system arbitrarily fast".
Lots of people cheat on benchmarks. We once benchmarked an email system that out-of-the-box didn't commit email messages to disk when receiving them. That meant that in the event of a power failure, they might lose user email. But out-of-the-box, they were faster than any other email system out there. You had to turn on the "reliable email delivery" option to make them commit the messages to disk - at which point their performance moved in line with everyone elses performance.
So was an out-of-the-box comparison fair in this case?
Unless you understand WHY NetBench is showing that Samba performs better than W2K3, you can't understand why the perf difference happened.
For example, some tests are disk bound, others are network card bound. Still others are cache bound, and others are CPU bound. All of this means that you might not be measuring the relative performance of the file&print servers, but instead are measuring the relative performance of the drivers for the hardware in the machine, and not the performance of the file server. The problem with this is that it means that your benchmark isn't repeatable on different hardware - it means that on THIS particular set of hardware, one performs better than the other, but on a different set of hardware, with different drivers, the opposite might be true.
For instance, it doesn't say that the test was performed on the same piece of hardware.
If it wasn't, how closely did they verify that the systems were identical? Things like chipset revisions can make huge differences in performance.
If it was, how did they isolate startup time effects? Did they vary the order of the tests to see if there were any effects?
One comment they made was "NVidia Geforce FX 5600 (as if this matters)". Actually, it DOES matter, it can make a HUGE amount of difference. I remember some GDI benchmarks we were doing years ago. Two different seemingly identical machines were reporting 10% different times. We eventually ripped them apart and started swapping hardware. We finally realized that the difference was that one had one particular brand of network card plugged in, the other had a different brand of network card in - the slowness tracked with one of the network cards.
It appears that the NT4 workstations were just random workstations pulled from their lab - did they ensure that the workstations were identical? I don't recall, but I believe that netbench measures performance on the client machine, which means that you need to keep your clients just as identical as the server. I'm also surprised that they're saying that NT4 clients were faster than W2K or XP clients - it's entirely possible, but it's surprising.
They also are making assumptions about the number of clients - they're assuming that the test isn't bottlenecked on the client, so they're assuming it's ok to let the benchmark simulate multiple clients from a single machine - they measured small numbers of clients, and extrapolated that the results they saw with small numbers of clients would be relevent with large numbers of clients - this may be true, but it may not.
Bottom line: Benchmarking is hard. Really, Really Hard. If you REALLY don't know what you're doing, it's unbelievably easy to generate results that appear to say one thing that are instead effectively meaningless.
That's why if you look at real-world benchmark results, they typically spend more time describing their configuration than they do describing their result - because professional benchmarkers realize that even tiny changes to the configuration can have HUGE results in the results, so they make sure that every possible variable has been accounted for to ensure that they're really measuring what should be measured.