Tech Off Thread

39 posts

Why does my C program run faster on Linux than on Windows?

Back to Forum: Tech Off
  • User profile image
    Shining Arcanine

    staceyw wrote:
    When I try to find this kind of thing, I try to eliminate as much as possible, then add stuff back until I find the issue.  Try running test on native OS in both cases w/o VM.  Also, verify your timer code is same in both cases. 


    I compiled a version of my program that had all of the multithreading code stripped by hand. It made no difference.

    I am not quite sure what you mean by "Try running test on native OS in both cases w/o VM."

    Also, the timer code is identical on both Ubuntu and Windows.

  • User profile image
    TimP

    Shining Arcanine wrote:
    
    I am not quite sure what you mean by "Try running test on native OS in both cases w/o VM."


    Run it without a virtual machine. The Ubuntu 6.10 disc can be used as a LiveCD. In this case though, the performance issues seem to be with the host operating system, so if anything running Linux on bare metal would only skew the numbers more.

    If you were having problems with the Linux side I would offer my help, but unfortunately I don't have much experience debugging or tuning unmanaged applications on Windows. Sad

  • User profile image
    Shining Arcanine

    TimP wrote:
    

    Shining Arcanine wrote:
    
    I am not quite sure what you mean by "Try running test on native OS in both cases w/o VM."


    Run it without a virtual machine. The Ubuntu 6.10 disc can be used as a LiveCD. In this case though, the performance issues seem to be with the host operating system, so if anything running Linux on bare metal would only skew the numbers more.

    If you were having problems with the Linux side I would offer my help, but unfortunately I don't have much experience debugging or tuning unmanaged applications on Windows.



    The LiveCD is an ISO that I use in a virtual optical drive. If I wanted to run it, I would have to burn it, but as you said, running Linux directly would only skew the numbers more, although I imagine that it would enable me to run my program in multithreaded mode, which would be really cool, especially on 64bit Linux where I would expect computation speeds to increase by a factor of 4 simply from the move to 64bit processing.

  • User profile image
    RichardRudek

    I've read about similar sorts of perf issues with various STL containers (C++) and all of the bounds-checking that occurs by default using the new compilers.

    I'd expect there's a similar thing at work here (with your C code), so research that.

  • User profile image
    Shining Arcanine

    RichardRudek wrote:
    I've read about similar sorts of perf issues with various STL containers (C++) and all of the bounds-checking that occurs by default using the new compilers.

    I'd expect there's a similar thing at work here (with your C code), so research that.


    Bounds checking? C is not supposed to have bounds checking. That is a Java feature.

    Bounds checking could be related is related to the Buffer Security Check, which I have disabled. I am not sure if there are any other features that Microsoft put into its compiler that would do this sort of checking. Is anyone from the Visual Studio team reading this thread and do you have anything to say about any of the things mentioned in the thread?

  • User profile image
    Antitorgo

    I really don't think it is an OS issue (I could almost guarantee it). My guess is that it is a difference in the compilers. I'm not sure how much if any code was changed to port GMP to Win32 but there could definitely be something going on there.

    Any chance you could post the code somewhere for people to take a look at and see if they are seeing the same thing you are?

    As far as anyone from the Visual Studio team responding... I'm not sure if anyone from those teams visits Channel 9. So I wouldn't count on anyone from there replying to this thread. (It is possible, just not likely IMO)

  • User profile image
    RichardRudek

    Shining Arcanine wrote:
    
     I also applied all of the optimization flags that I applied to my project to the lib_gmp_gc project.


    I've never used this library. So I go and download the bits, and then realise what you said above, which I embolded.

    The whole point of this library is that key parts of it are compiled using hand optimised assembler. Yet you've elected to compile the Windows library using the Generic C version. Why ?

  • User profile image
    Ion Todirel

    maybe the problem is in widows port of gmp?

  • User profile image
    Shining Arcanine

    RichardRudek wrote:
    
    Shining Arcanine wrote:
    
     I also applied all of the optimization flags that I applied to my project to the lib_gmp_gc project.


    I've never used this library. So I go and download the bits, and then realise what you said above, which I embolded.

    The whole point of this library is that key parts of it are compiled using hand optimised assembler. Yet you've elected to compile the Windows library using the Generic C version. Why ?


    Good point. GMP on Ubuntu compiled optimized for the k8 architecture, so it probably has the advantage of optimized assembly.

    I recompiled my program with P4 optimized assembly and there was quite a speed increase.

    Here are the new numbers. Note that I updated the Ubuntu numbers with the ones from the tests I did for TimP, as they were slightly better than the previous ones and I was using the best that I could get for windows, so it was only fair:

    It takes 1 second to test M0 through M2281 on Ubuntu and 2 seconds on Windows, making Ubuntu 100% faster.
    It takes 4 seconds to test M2281 through M3217 on Ubuntu and 3 seconds on Windows, making Ubuntu 33% slower.
    It takes 5 seconds to test M0 through M3217 on Ubuntu and 5 seconds on Windows, making it a tie.
    It takes 6 seconds to test M21701 on Ubuntu and 6 seconds on Windows, making it a tie.

    This solves the mystery. This also demonstrates that human compiled code can be far better optimized than machine compiled code. Thankyou everyone for your help.

    By the way, I made a mistake when posting the original Ubuntu numbers. The 5 second figure was for M2281 through M3217 and the 6 second figure was for M0 through M3217. It is not a major issue, but the math does not add up when you consider the fact that I simply split M0 to M3217 into two separate test runs for the other tests.

  • User profile image
    Yggdrasil

    Shining Arcanine wrote:
    
    It takes 1 second to test M0 through M2281 on Ubuntu and 2 seconds on Windows, making Ubuntu 100% faster.


    From what I remember of your original numbers, you're getting huge rounding errors here. The 1sec reported by time on linux was actually 1.6seconds. Try calculating this with more decimal points and you'll get wildly different ratios.

  • User profile image
    Shining Arcanine

    Yggdrasil wrote:
    
    Shining Arcanine wrote:
    
    It takes 1 second to test M0 through M2281 on Ubuntu and 2 seconds on Windows, making Ubuntu 100% faster.


    From what I remember of your original numbers, you're getting huge rounding errors here. The 1sec reported by time on linux was actually 1.6seconds. Try calculating this with more decimal points and you'll get wildly different ratios.


    I cannot calculate this with more decimal points. time.h only reports seconds.

    The case M0 through M3217 includes M0 through M2281 and M2281 through M3217 and is the same for both Ubuntu and Windows, so as far as I am concerned, performance on Windows and performance on Ubuntu are identical.

  • User profile image
    Yggdrasil

    Shining Arcanine wrote:
    I cannot calculate this with more decimal points. time.h only reports seconds.


    Shining Arcanine wrote:

    richard@richard-desktop:~$ time ./a.out 0 2281
    .
    .
    .
    17 prime numbers were found. It took 1.00 seconds
    real    0m1.676s
    user    0m1.652s
    sys     0m0.008s


    Isn't this good enough?

  • User profile image
    Shining Arcanine

    Not when you consider the fact that there is no analogous measurement method available on Windows.

    I would have to write a time program myself and I have other things to do.

  • User profile image
    Yggdrasil

    Shining Arcanine wrote:
    Not when you consider the fact that there is no analogous measurement method avaliable on Windows..


    You didn't look very hard, did you?


    ptime 1.0 for Win32, Freeware - http://www.pc-tools.net/
    Copyright(C) 2002, Jem Berkes <jberkes@pc-tools.net>

    Syntax: ptime command [arguments ...]

    ptime will run the specified command and measure the execution time (run time) in seconds, accurate to 5 millisecond or better. It is an automatic process timer, or program timer.

  • User profile image
    evildictait​or

    W3bbo wrote:
    
    TimP wrote:
    Perhaps gcc is performing optimizations that VC++ is not. Try compiling them both with no optimizations and test the run times. If they're equivalent, then gcc is optimizing more aggressively, otherwise the culprit is elsewhere.


    Optimizations are good... but I'd never expect a 50% gain in performance with them.


    Depends on the program. Some programs simply can't be optimised, others can get a 90% improvement using cheap optimisations. In general debug optimisations give a 7% increase in speed but take up more memory, semi-release optimisations get an about 20% improvement and a full-optimisation for final release is about 25-40% speed-up.



    Shining Arcanine wrote:
    
    Bounds checking? C is not supposed to have bounds checking. That is a Java feature.

    Bounds checking could be related is related to the Buffer Security Check, which I have disabled. I am not sure if there are any other features that Microsoft put into its compiler that would do this sort of checking. Is anyone from the Visual Studio team reading this thread and do you have anything to say about any of the things mentioned in the thread?


    Bounds checking in C is hard, but possible, and actually happens in lots of modern C compilers (including GCC and VC++). With static arrays it's easy:

    int c[100];
    for(int i=0;i<=100;i++){
      if(i > 100) #error // inserted by the compiler.
      c[i] = i;
    }

    Will error when you build the flow graph. With dynamically allocated memory it's more expensive:

    int * myArray = calloc(100, sizeof(int));
    for(int i=0;i<=100;i++){
      myArray[i] = i;
    }


    int * myArray = calloc(100, sizeof(int));
    for(int i=0;i<=100;i++){
      int __cptr = (int)myArray + sizeof(int) * i;
      if(malloc_index(__cptr) == -1 || malloc_index(myArray) != malloc_index(__cptr)) #error
      *__cptr = i;

    }

    where malloc_index looks through the malloc table (O(log n)) and returns the index in the table that corresponds to the pointer.



    Shining Arcanine wrote:
    Not when you consider the fact that there is no analogous measurement method available on Windows.

    I would have to write a time program myself and I have other things to do.


    You should do the test many times:

    float f;
    int i, time;

    setup_low_resolution_timer();
    for(i=0;i<1000;i++){
      do_function();
    }
    time = stop_low_resolution_timer();

    f = ((float)time / 1000);

  • User profile image
    Shining Arcanine

    evildictaitor wrote:
    
    Shining Arcanine wrote:
    
    Bounds checking? C is not supposed to have bounds checking. That is a Java feature.

    Bounds checking could be related is related to the Buffer Security Check, which I have disabled. I am not sure if there are any other features that Microsoft put into its compiler that would do this sort of checking. Is anyone from the Visual Studio team reading this thread and do you have anything to say about any of the things mentioned in the thread?


    Bounds checking in C is hard, but possible, and actually happens in lots of modern C compilers (including GCC and VC++). With static arrays it's easy:

    int c[100];
    for(int i=0;i<=100;i++){
      if(i > 100) #error // inserted by the compiler.
      c[i] = i;
    }

    Will error when you build the flow graph. With dynamically allocated memory it's more expensive:

    int * myArray = calloc(100, sizeof(int));
    for(int i=0;i<=100;i++){
      myArray[i] = i;
    }


    int * myArray = calloc(100, sizeof(int));
    for(int i=0;i<=100;i++){
      int __cptr = (int)myArray + sizeof(int) * i;
      if(malloc_index(__cptr) == -1 || malloc_index(myArray) != malloc_index(__cptr)) #error
      *__cptr = i;

    }

    where malloc_index looks through the malloc table (O(log n)) and returns the index in the table that corresponds to the pointer.


    I do not want the compiler to try to second guess whether or not I have adequetely ensured that my loop bounds are always in range. Is there any way to disable that feature in GCC (3.4.6 and up) and Visual Studio 2008 Professional?

    evildictaitor wrote:
    
    Shining Arcanine wrote:
    Not when you consider the fact that there is no analogous measurement method available on Windows.

    I would have to write a time program myself and I have other things to do.


    You should do the test many times:

    float f;
    int i, time;

    setup_low_resolution_timer();
    for(i=0;i<1000;i++){
      do_function();
    }
    time = stop_low_resolution_timer();

    f = ((float)time / 1000);


    That is an option that I will probably use in the future when tweaking my program to get more performance, although for now, I am considering the case to be closed. I believe the minor deviations in the figures are the result of the use of different optimized assembly routines on Windows and Ubuntu, as GMP has optimized assembly for many different platforms (although not as many as I would like, as that there is no optimized assembly for the Core 2 Duo) and Brian Gladman's Visual Studio patch for GMP only provides project files for a few of them.

    Since GMP's make file system was compiling it to be optimized for the K8 microarchitecture, I assume that it used the K7 optimized assembly, which is separate from the P4 optimized assembly that I am using in Visual Studio. The Visual Studio patch provides no project files for the K7 optimized assembly and the cost of creating them to verify my conjecture far outweighs the potential gains, so further analysis in the context of higher priority things (I have to write a scanner by Tuesday for my C++ class and I have to be prepared for two midterms by Thursday) being on my to do list is pointless.

  • User profile image
    evildictait​or

    Shining Arcanine wrote:
    
    I do not want the compiler to try to second guess whether or not I have adequetely ensured that my loop bounds are always in range. Is there any way to disable that feature in GCC (3.4.6 and up) and Visual Studio 2008 Professional?

    In VS08 it's called bounds checking and it can be disabled as one of the options for the compiler. In GCC it's off (for C) by default, and better still it can be abstracted away to a compile-time check for most pointers. (link)


    Shining Arcanine wrote:
    That is an option that I will probably use in the future when tweaking my program to get more performance, although for now, I am considering the case to be closed.


    OK, but you need to remember that when working with such small time intervals even things like the scheduler can completely throw your numbers, so doing a time-complexity analysis or just running the tests lots is a better way to see which program is faster, but if you're doing it just for fun, a full analysis is somewhat overkill.

  • User profile image
    longnight

    You might try running the windows app in a virtual windows os. It could be something like a virus scanner interfering with the numbers.

Comments closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.