To get reliable results, you need to perform multiple runs, preferably on different types of hardware. The variance is usually more significant than the precision of the timer, so measuring small differences in performance comes down to probability and statistics. We've found Student's t-test to be useful for measuring the difference between the sets of runs.
It's a great question but any meaningful answer must be qualified with "compared with what?"
Tasks definitely have overhead over plain OS threads, and if the amount of work you're doing in a task is small, the overhead becomes more pronounced. Because tasks are built on top of other PPL/ConcRT constructs, they have overhead over these constructs. So one way to look at the efficiency is to try to solve the same problem using different constructs and compare the results.
To take one specific example, I can calculate a Fibonacci number (using the naïve recursive algorithm) in a parallel_for loop from 0 to 100, then do the same by spawning 100 tasks that do the same, and compare the elapsed time of both solutions. The data I get on my laptop shows that the tasks-based solution is about 5% slower than the parallel_for-based solution. I'm not too worried about this kind of overhead because a) it's small and more importantly b) you would never do this in a real-world application - parallel_for is a better solution for this problem.
Some of our other performance tests show that the overhead can be quite significant, so more optimization is in order.
Now to my main point. For problems where PPL tasks offer a more productive programming model, their performance should be "good enough" so that you don't have to fall back to a less productive programming model, such as OS threads. If we have accomplished that, we have succeeded. If you want to use PPL tasks but are forced to use some lower-level constructs to get the performance you need, we have failed.