neerajsi neerajsi

Niner since 2008


  • Arun Kishan: Inside Windows 7 - Farewell to the Windows Kernel Dispatcher Lock

    Hi MaidenDotNet,

    The OS will already page out your stacks/process after some time of inactivity.  I believe a thread becomes a candidate after about 4 seconds. Once the pages become a candidate for theft, then it's only a matter of time and memory pressure when the Memory Manager will rip them away and use them for something else. 


    Basically, you don't need to give a 'hint' to the OS since it will do what you want on its own.  On the other hand, if you want the thread to run quickly when it gets signaled, you may be in trouble because of this behavior.  If the pages make it to the disk, it could take about 10-15 milliseconds between when your thread is activated and when it can run (this is the typical seek time of a laptop disk). If the disk is already busy, it could take even longer.

  • Stephan T. Lavavej: Digging into C++ Technical Report 1 (TR1)

    I think you guys are giving GC a bad rap (and I say this as a person who works in kernel land exclusively in C, not even C++).  
    If you wish to write a multi-threaded program, maintaining reference counts becomes quite costly with the large number of interlocked operations you need to perform.  I think shared_ptr uses atomics or locks by default (according to the boost synopsis).  Allocating memory is also a pain because you have to lock down pieces of your heap and you get some internal and external fragmentation from your allocator.  This is okay for certain types of programs because they only deal with objects of certain sizes and so can pull tricks like per-thread memory caching, but you probably have to implement this yourself by overriding news.

    GC doesn't suffer from this problem.  Allocation is handled by incrementing a pointer (this one does have to be done interlocked), but further accesses to the object and copying of references do not require further synchronization.  As an additional optimization, Sun's GC creates per-thread first-gen memory pools called TLABs that allow you to make small allocations without any synchronization at all until you exhaust the TLAB block.  I'm not sure whether or .NET does this yet, but it's a relatively straightforward optimization.

    Stephan mentions the memory efficiency of shared_ptr-based C++ solutions, but I don't think that's the case.  Looking at the constructor implementation of shared_ptr from the Dr. Dobb's article[1], shared_ptr uses a pointer-sized element to refer to the object (unavoidable) plus a pointer to a reference count block and the reference count block itself which is new()ed up from the heap at 16-byte allocation granularity (3/4th of it is wasted space).  So for an object with only one reference you take up 32 bytes of memory plus the object itself and likely pull in three CPU cache lines (1 on the stack, one for the refcount, and one for the object itself that you're accessing).  And your allocated objects are sparser in memory because of the normal fragmentation effects that can't be avoided except by mark-and-sweep GCs.  I wouldn't necessarily call this memory efficiency.

    GC certainly appears to allocate bigger segments (such as 16 MB default generation segments... or even larger ones), but the OS memory manager is smart enough, thanks to Landy Wang, that it won't actually do anything to give you that memory until you touch it.  So if the GC can keep things tightly compacted (i.e. you have well-defined cycles of allocation and the frees), you get better cache locality (references are only pointer-sized and objects can be packed densely with a 12-byte overhead for a synch structure and a method table pointer, a thing you'd need for any polymorphic C type) and you shouldn't really take a much higher working-set cost. 

    I buy the Patrick Dussud Kool-Aid and think for memory, GC is clearly superior.  The arguments against GC for other resources is totally valid and GC languages would do well to incorporate RAII-style syntax for dispose that's even more automatic than "using".  C++/CLI has this already (just declare your managed variable as a stack type and you get Dispose called automatically). 

    If you make a GCed runtime that does not attempt to add extra safety guarantees like verifiable code or type-checking on cast, I think simple code would run faster than the transliteration of the code to simple TR-style C++ without custom allocation schemes or complicated wizardry.