OK, I understand now. I don't know of a way to remove this overhead, std::function is horrible complex and the compiler has little chance to optimize it.

It seems to me that if you don't provide the function type in the template argument list (like in the original code or in your code) then the only alternative is to use std::function or something that does something similar: wraps a callable object. And this wrapping is complex because there are different types of callable objects: pointer to function, pointer to member, functors.