Tech Off Thread

39 posts

Why does my C program run faster on Linux than on Windows?

Back to Forum: Tech Off
  • Shining Arcanine

    Last November, I wrote a C program that can find large prime numbers (using the Lucas Lehmer test for mersenne prime numbers). Today, I installed Ubuntu 6.10 LTS in Microsoft Virtual PC 2007 and proceeded to install gcc, m4, autoconf and gmp. I also installed all of the patches available for Ubuntu and upgraded the kernel to the i686 version.

    I downloaded the source code for my program from my university's unix server using sch and compiled my program using GCC. Imagine my surprise when I discovered that my program executed in a Virtual PC far faster than it executes on Windows.

    Here are some numbers:

    Testing all of the mersenne numbers ((2^x) - 1) between 0 and 2281 takes 1 second on Ubuntu and 3 seconds on windows. Testing all of the mersenne numbers between 0 and 3217 takes 5 seconds on Ubuntu and 10 seconds on Windows. Testing M21701 ((2^21701) - 1) takes 6 seconds on Ubuntu and 20 seconds on Windows.

    You could say that my program runs twice as fast on Ubuntu, but that would be wrong, because Microsoft Virtual PC does not have SMP support, so my these tests are being run in a single thread on Ubuntu and two threads on Windows (three if you include the main thread that waits for the other threads to terminate) with the exception of the test of M21701, which uses a single thread on both.

    Since I am using win32-pthreads for multithreading, I decided to do another run that would eliminate shared resources between the threads as a source of overhead by ensuring that each thread had a ton of work to do before needing to lock the mutex. So I tested all of the mersenne numbers between 2281 and 3217. It took 4 seconds on Ubuntu and 7 seconds on Windows.

    The variables here are the operating systems and the compilers. I am running Windows Media Center 2005 Edition with Visual Studio 2008 Professional and under Microsoft Virtual PC 2007, Ubuntu 6.10 LTS with GCC 4.03. I am compiling my program in Visual Studio 2008 Professional under release mode with all compilation flags that I could find set. That includes /Ox, /Ob2, /Ot, /Oy, /GL, /arch:SSE2, /fp:fast and /GS-. I am compiling my program on Ubuntu with the following command:

    gcc -m32 -O2 -fomit-frame-pointer -mtune=k8 -march=k8 mersenne.c /usr/local/lib/gmp.so

    There are no background processes running that are sucking up CPU resources and I did each test (particularly on Windows) several times to try to minimize the negativity of the results.

    The code is the same and the hardware is the same. The only other variable is Virtual PC, which should be harming performance on Ubuntu, not Windows. So would anyone be able to tell me, exactly what is making my program run so much slower on Windows than it runs on Ubuntu?

  • RichA

    I don't know if this is the answer, but in both cases, the hardware is definitely NOT the same.

    Your Windows install may have some driver that is slowing it down that the Ubuntu doesn't have because it's running on virtualized hardware.

  • Shining Arcanine

    RichA wrote:
    

    I don't know if this is the answer, but in both cases, the hardware is definitely NOT the same.

    Your Windows install may have some driver that is slowing it down that the Ubuntu doesn't have because it's running on virtualized hardware.



    The processors are identical and that is what matters for my program. The only difference is that my program is running on Windows in multithreaded mode versus single threaded mode on Ubuntu. The figures for single threaded mode on Windows are literally twice the ones I provided for the multithreaded version, with the exception of the one for the test of M21701, which is run using a single thread.

    As for Ubuntu having a potential driver advantage, because it is running in Virtual PC, it has to interface with emulated hardware while Windows can interact directly with the hardware and also has the latest drivers from Intel, Nvidia, Creative etcetera. If the drivers on Windows were suboptimal, I would expect that to directly affect any OS running under Virtual PC.

  • W3bbo

    Have you tried running a binary for Windows compiled with GCC?

    Or the VC built version under WINE under Ubuntu?

  • Shining Arcanine

    W3bbo wrote:
    Have you tried running a binary for Windows compiled with GCC?

    Or the VC built version under WINE under Ubuntu?


    No. How would I go about doing that?

    By the way, I reran the tests on Windows in single threaded mode with the processor affinity set so that the process would not jump from core to core and started my program from Visual Studio without debugging. The numbers after doing that were 5 seconds for 0 - 2281, 17 seconds for 0 - 3217, 12 seconds for 2281 - 3217 and 20 seconds for M21701. These numbers look even worse for Windows (the previous ones were slanted in Windows' favor as they were taken with the multithreaded version), but it is an apples to apples comparison.

  • Antitorgo

    I see that you are linking to GMP in Linux, what math library are you using in Windows? Are you using a port of GMP for Windows? If so, which port? Where did you get it from and how was it compiled?

    Since computing Mersenne Primes with the Lucas Lehmer test is pretty FFT/iFFT intensive the quality of whatever math library you are using is key.

    Seeing as how Prime95 for windows and GLucas for linux/mac both seem to run about the same speed, (granted they both use highly optimized assembly for the FFTs) I highly doubt that Windows is making things slower. (See http://www.mersenne.org/)

    You claim to be comparing apples to apples, but I highly doubt it.

  • Shining Arcanine

    Antitorgo wrote:
    

    I see that you are linking to GMP in Linux, what math library are you using in Windows? Are you using a port of GMP for Windows? If so, which port? Where did you get it from and how was it compiled?

    Since computing Mersenne Primes with the Lucas Lehmer test is pretty FFT/iFFT intensive the quality of whatever math library you are using is key.

    Seeing as how Prime95 for windows and GLucas for linux/mac both seem to run about the same speed, (granted they both use highly optimized assembly for the FFTs) I highly doubt that Windows is making things slower. (See http://www.mersenne.org/)

    You claim to be comparing apples to apples, but I highly doubt it.



    I am using GMP on windows as well:

    http://fp.gladman.plus.com/computing/gmp4win.htm

    I simply had to apply the patch, follow the instructions in the README, compile lib_gmp_gc in Visual Studio, add it to my solution and put the path to the build.vc8 directory as an additional includes directory in my project's property pages. I also applied all of the optimization flags that I applied to my project to the lib_gmp_gc project.

    By the way, I have been to mersenne.org in the past and I have a thread discussing this on mersenneforum.org:

    http://www.mersenneforum.org/showthread.php?p=127550

  • TimP

    Perhaps gcc is performing optimizations that VC++ is not. Try compiling them both with no optimizations and test the run times. If they're equivalent, then gcc is optimizing more aggressively, otherwise the culprit is elsewhere.

  • W3bbo

    TimP wrote:
    Perhaps gcc is performing optimizations that VC++ is not. Try compiling them both with no optimizations and test the run times. If they're equivalent, then gcc is optimizing more aggressively, otherwise the culprit is elsewhere.


    Optimizations are good... but I'd never expect a 50% gain in performance with them.

  • Shining Arcanine

    TimP wrote:
    Perhaps gcc is performing optimizations that VC++ is not. Try compiling them both with no optimizations and test the run times. If they're equivalent, then gcc is optimizing more aggressively, otherwise the culprit is elsewhere.


    It would be difficult to compile my program with no optimizations in GCC, as GMP's make script automatically compiles GMP with the most optimal compiler flags available while Visual Studio will recompile GMP as a debug build if I was to try to compile my program in debug mode. That prevents a fair comparsion with optimizations disabled.

  • TimP

    W3bbo wrote:
    
    TimP wrote:
    Perhaps gcc is performing optimizations that VC++ is not. Try compiling them both with no optimizations and test the run times. If they're equivalent, then gcc is optimizing more aggressively, otherwise the culprit is elsewhere.



    Optimizations are good... but I'd never expect a 50% gain in performance with them.



    I would assume so too, but it's worth ruling out.

    I'm slightly confused by the invocation of gcc.

    gcc -m32 -O2 -fomit-frame-pointer -mtune=k8 -march=k8 mersenne.c /usr/local/lib/gmp.so

    Adding the path to gmp.so, in particular. If my understanding of the gcc documentation is accurate, it will regard gmp.so as an "in file", which would seem to imply (if gcc is not complaining) that gmp.so is being statically linked with your executable. You can test this by running ldd yourbinary. If you don't see gmp.so in the output, you're statically linking. I'm not well versed on VC++, so I don't know if your VC++ build is performing static linking or how you would check on Windows. Can anyone chime in?

    Back to the point at hand though, if this was the case you should have a fixed delay since dynamic linking resolution is done at invocation and the runtime differences would always be different by a roughly fixed amount.

  • Shining Arcanine

    W3bbo wrote:
    
    TimP wrote:
    Perhaps gcc is performing optimizations that VC++ is not. Try compiling them both with no optimizations and test the run times. If they're equivalent, then gcc is optimizing more aggressively, otherwise the culprit is elsewhere.


    Optimizations are good... but I'd never expect a 50% gain in performance with them.


    The gain in performance is far greater than 50% depending on the test case, as when I ran the original tests, I had the multithreaded code path enabled in Visual Studio. I reran them with the single threaded code path and posted the new numbers not that long ago.

    I will restate them with gains in performance computated:

    It takes 1 second to test M0 through M2281 on Ubuntu and 5 seconds on Windows, making Ubuntu 400% faster.
    It takes 5 seconds to test M0 through M3217 on Ubuntu and 12 seconds on Windows, making Ubuntu 140% faster.
    It takes 6 seconds to test M2281 through M3217 on Ubuntu and 17 seconds on Windows, making Ubuntu 183% faster.
    It takes 7 seconds to test M21701 on Ubuntu and 20 seconds on Windows, making Ubuntu 186% faster.

    TimP wrote:
    
    W3bbo wrote:
    
    TimP wrote:
    Perhaps gcc is performing optimizations that VC++ is not. Try compiling them both with no optimizations and test the run times. If they're equivalent, then gcc is optimizing more aggressively, otherwise the culprit is elsewhere.



    Optimizations are good... but I'd never expect a 50% gain in performance with them.



    I would assume so too, but it's worth ruling out.

    I'm slightly confused by the invocation of gcc.

    gcc -m32 -O2 -fomit-frame-pointer -mtune=k8 -march=k8 mersenne.c /usr/local/lib/gmp.so

    Adding the path to gmp.so, in particular. If my understanding of the gcc documentation is accurate, it will regard gmp.so as an "in file", which would seem to imply (if gcc is not complaining) that gmp.so is being statically linked with your executable. You can test this by running ldd yourbinary. If you don't see gmp.so in the output, you're statically linking. I'm not well versed on VC++, so I don't know if your VC++ build is performing static linking or how you would check on Windows. Can anyone chime in?

    Back to the point at hand though, if this was the case you should have a fixed delay since dynamic linking resolution is done at invocation and the runtime differences would always be different by a roughly fixed amount.



    Here is the output:

    richard@richard-desktop:~$ ldd ./a.out
            linux-gate.so.1 =>  (0xffffe000)
            libgmp.so.3 => /usr/local/lib/libgmp.so.3 (0xb7f7a000)
            libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7e40000)
            /lib/ld-linux.so.2 (0xb7fb4000)

  • Shining Arcanine

    TimP wrote:
    

    So it is indeed doing dynamic linking. Maybe passing shared objects to gcc implicitly tells it to link them dynamically, but I've never seen it done before (I usually see -l<lib>).

    Since the running time seems to grow with the size of the input, my only guess is that threading is being (ab)used in a way that pthreads handle more gracefully than Windows threading (assuming pthreads-win32 is a wrapper for Windows threads). I would attempt to run to core calculations of the algorithm without any threading involved and compare the numbers.



    Those numbers are without any threading involved. The original ones for WIndows were with threading involved.

  • TimP

    So it is indeed doing dynamic linking. Maybe passing shared objects to gcc implicitly tells it to link them dynamically, but I've never seen it done before (I usually see -l<lib>).

    Since the running time seems to grow with the size of the input, my only guess is that threading is being (ab)used in a way that pthreads handle more gracefully than Windows threading (assuming pthreads-win32 is a wrapper for Windows threads). I would attempt to run to core calculations of the algorithm without any threading involved and compare the numbers. I'm not sure what you're using to do your timing, but there is a time program standard on Linux that you can use that will give you a more detailed breakdown.

    time ./a.out

  • Shining Arcanine

    TimP wrote:
    

    So it is indeed doing dynamic linking. Maybe passing shared objects to gcc implicitly tells it to link them dynamically, but I've never seen it done before (I usually see -l<lib>).

    Since the running time seems to grow with the size of the input, my only guess is that threading is being (ab)used in a way that pthreads handle more gracefully than Windows threading (assuming pthreads-win32 is a wrapper for Windows threads). I would attempt to run to core calculations of the algorithm without any threading involved and compare the numbers. I'm not sure what you're using to do your timing, but there is a time program standard on Linux that you can use.

    time ./a.out



    I see you edited your post. Anyway, I am using time.h to do timing. I reran my program using Linux's time program and here is the output:

    richard@richard-desktop:~$ time ./a.out 0 2281
    M2 is prime
    M3 is prime
    M5 is prime
    M7 is prime
    M13 is prime
    M17 is prime
    M19 is prime
    M31 is prime
    M61 is prime
    M89 is prime
    M107 is prime
    M127 is prime
    M521 is prime
    M607 is prime
    M1279 is prime
    M2203 is prime
    M2281 is prime

    17 prime numbers were found. It took 1.00 seconds

    real    0m1.676s
    user    0m1.652s
    sys     0m0.008s
    richard@richard-desktop:~$ time ./a.out 0 3217
    M2 is prime
    M3 is prime
    M5 is prime
    M7 is prime
    M13 is prime
    M17 is prime
    M19 is prime
    M31 is prime
    M61 is prime
    M89 is prime
    M107 is prime
    M127 is prime
    M521 is prime
    M607 is prime
    M1279 is prime
    M2203 is prime
    M2281 is prime
    M3217 is prime

    18 prime numbers were found. It took 5.00 seconds

    real    0m5.284s
    user    0m5.228s
    sys     0m0.008s
    richard@richard-desktop:~$ time ./a.out 2281 3217
    M2281 is prime
    M3217 is prime

    2 prime numbers were found. It took 4.00 seconds

    real    0m3.684s
    user    0m3.620s
    sys     0m0.016s
    richard@richard-desktop:~$ time ./a.out 21701
    M21701 is prime

    1 prime numbers were found. It took 6.00 seconds

    real    0m6.612s
    user    0m6.204s
    sys     0m0.020s

  • TimP

    Shining Arcanine wrote:
    
    TimP wrote:
    

    So it is indeed doing dynamic linking. Maybe passing shared objects to gcc implicitly tells it to link them dynamically, but I've never seen it done before (I usually see -l<lib>).

    Since the running time seems to grow with the size of the input, my only guess is that threading is being (ab)used in a way that pthreads handle more gracefully than Windows threading (assuming pthreads-win32 is a wrapper for Windows threads). I would attempt to run to core calculations of the algorithm without any threading involved and compare the numbers.



    Those numbers are without any threading involved. The original ones for WIndows were with threading involved.


    Is the processor multicore? (I didn't see it mentioned in the OP) If it isn't, threading code is just overhead and performance will be degraded on your Windows test.

  • Shining Arcanine

    TimP wrote:
    
    Shining Arcanine wrote:
    
    TimP wrote:
    

    So it is indeed doing dynamic linking. Maybe passing shared objects to gcc implicitly tells it to link them dynamically, but I've never seen it done before (I usually see -l<lib>).

    Since the running time seems to grow with the size of the input, my only guess is that threading is being (ab)used in a way that pthreads handle more gracefully than Windows threading (assuming pthreads-win32 is a wrapper for Windows threads). I would attempt to run to core calculations of the algorithm without any threading involved and compare the numbers.



    Those numbers are without any threading involved. The original ones for WIndows were with threading involved.


    Is the processor multicore? (I didn't see it mentioned in the OP) If it isn't, threading code is just overhead and performance will be degraded on your Windows test.


    It is an Intel Core 2 Duo E6300 with 4GB of RAM (4x1GB) on an Asus P5K Deluxe motherboard.

  • staceyw

    When I try to find this kind of thing, I try to eliminate as much as possible, then add stuff back until I find the issue.  Try running test on native OS in both cases w/o VM.  Also, verify your timer code is same in both cases. 

Comments closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.