C++ and Beyond 2012: Herb Sutter - atomic<> Weapons, 2 of 2

Play C++ and Beyond 2012: Herb Sutter - atomic&lt;&gt; Weapons, 2 of 2
Sign in to queue


Herb Sutter presents atomic<> Weapons, 2 of 2. This was filmed at C++ and Beyond 2012. As the title suggests, this is a two part series (given the depth of treatment and complexity of the subject matter).

STOP! => Watch part 1 first!

Download the slides.


This session in one word: Deep.

It's a session that includes topics I've publicly said for years is Stuff You Shouldn't Need To Know and I Just Won't Teach, but it's becoming achingly clear that people do need to know about it. Achingly, heartbreakingly clear, because some hardware incents you to pull out the big guns to achieve top performance, and C++ programmers just are so addicted to full performance that they'll reach for the big red levers with the flashing warning lights. Since we can't keep people from pulling the big red levers, we'd better document the A to Z of what the levers actually do, so that people don't SCRAM unless they really, really, really meant to.

Topics Covered:

  • The facts: The C++11 memory model and what it requires you to do to make sure your code is correct and stays correct. We'll include clear answers to several FAQs: "how do the compiler and hardware cooperate to remember how to respect these rules?", "what is a race condition?", and the ageless one-hand-clapping question "how is a race condition like a debugger?"
  • The tools: The deep interrelationships and fundamental tradeoffs among mutexes, atomics, and fences/barriers. I'll try to convince you why standalone memory barriers are bad, and why barriers should always be associated with a specific load or store.
  • The unspeakables: I'll grudgingly and reluctantly talk about the Thing I Said I'd Never Teach That Programmers Should Never Need To Now: relaxed atomics. Don't use them! If you can avoid it. But here's what you need to know, even though it would be nice if you didn't need to know it.
  • The rapidly-changing hardware reality: How locks and atomics map to hardware instructions on ARM and x86/x64, and throw in POWER and Itanium for good measure – and I'll cover how and why the answers are actually different last year and this year, and how they will likely be different again a few years from now. We'll cover how the latest CPU and GPU hardware memory models are rapidly evolving, and how this directly affects C++ programmers.



Download this episode

The Discussion

  • User profile image

    The link to part one under the "More episodes in this show" section of this page doesn't work. There's probably some HTML escaping bug involved as the angle brackets are not displayed correctly in the link text.

  • User profile image

    @ajasmin: Where is this link?


  • User profile image

    @Charles http://imgur.com/aUfHGnH Actually, I think this is a link to the same page we're on, it makes sense that it isn't click-able but there's still a minor problem with the escaping..

    I feel bad bringing this up in the comments as it doesn't relates to the contents. If possible feel free to delete these comments after the fact.

    The talk itself was the most thorough presentation I've seen on the topic. Quite enlightening.

  • User profile image

    @ajasmin: Thanks for the bug report! Smiley

  • User profile image

    @Charles The page title has the same problem: http://imgur.com/KwHatso

  • User profile image


    All better now. Actually wasn't an encoding issue. Rather the way we were truncating the title. Anyway, this issue is fixed and deployed.

    Thanks for pointing this out ajasmin.

  • User profile image


    Well, got the first issue fixed Perplexed

    Thanks for noticing the title. We will clean that one up too

  • User profile image

    AWESOME, learned a lot.

    The bad thing is, now I fear writing any multithreaded code since I'm not using a C++11 compiler! coding without a memory model is so scary ! How many bugs are there in VC10 ? VC9 ?

  • User profile image
    Matthew Fioravante

    Herb, on your object layout considerations slide (2).

    What if you have this:

    char s[2]; //or char* s = new char[2];

    Thread 1
    { std::lock_guard<mutex>(s0mutex);
    s[0] = 1;

    Thread 2
    { std::lock_guard<mutex>(s1mutex);
    s[1] = 1;

    And assume the processor can only do 32 bit reads/writes.

    Does this mean that the compiler must order every character on a 32 bit boundary, essentially increasing the size of every character array/string by 4?

    Was this true before the C++11 memory model was introduced?

  • User profile image
    Matthew Fioravante

    And following up on my previous post. If all of the sudden characters have an alignment requirement of 4 bytes, then one can no longer do IO of binary data into character arrays.

  • User profile image

    while (!stop.load<relaxed>()) { <= what keeps the load from being moved out of the while altogether and the thread never stops?

  • User profile image

    The consensus seems to be that the compiler can't completely optimize away the load operation from the loop, because the standard states in 1.10/25 that "An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time."

    See also the comments by Bartosz Milewski and Anthony Williams here:

    And the "memory clobber" note here:

  • User profile image

    I do also have a question regarding the slide at 1:19:00 (page 48 in the linked slides):
    Why can't the "stop = true" assignment be relaxed when it doesn't publish data?

    And I have another question regarding a slide that Herb skips over at 1:25:48 (page 54 in the linked slides) and that includes this code for lazily creating a widget instance:

    atomic<widget*> widget::instance = nullptr;
    atomic<bool> widget::create = false;

    widget* widget::get_instance() {
    if (instance.load(memory_order_acquire) == nullptr)
    if (!create.exchange_exchange_explicit(true, memory_order_relaxed))
    instance.store(new widget(), memory_order_release);
    else while(instance.load(memory_order_acquire) == nullptr) {}
    return instance.load(memory_order_acquire);

    The slide states that using relaxed in the exchange operation is bad because the code "could do some widget creation if CAS fails - and worse". However, I'm having difficulties seeing how the relaxed exchange in the if condition could result in observably bad behaviour. Can anyone help me out?

    BTW, somebody should point out that there currently is a severe problem with the code generation for atomic operations by VC++2012: http://connect.microsoft.com/VisualStudio/feedback/details/770885/std-atomic-load-implementation-is-absurdly-slow

  • User profile image

    @Matthew: True and new in C++11 (and C11 and Java), but nothing to worry about because you'll likely never encounter a processor that can't do single-byte loads and stores.

    @Matthias: The memory model says that compilers/processors/caches can't transform potentially infinite loops into non-infinite loops.

    @Stephan, quick responses:

    • The stop=true; can't be relaxed because otherwise it could float up across the launch (annoying but mostly benign, just causing workers to always immediately stop) or down across the join (oops, program will never terminate).
    • The exchange_explicit can't be relaxed because for example part of "new widget" could float up speculatively out of the if.
    • Yes, VS 2012 atomic code gen is very pessimized for x86/x64; we're fixing it. ARMv7 code gen is already pretty good, modulo ARMv7's own limitations described in the talk.
  • User profile image


    Herb, thank you for your reply!

    Regarding a relaxed stop=true on page 49: launch_workers() and join_workers() should normally both synchronize with the launched and joined workers, so shouldn't that prevent the worker threads from seeing the assignment too early or too late?

    Regarding the relaxed exchange_explicit on page 54: If I'm not mistaken, the compiler or CPU can't move anything from the if-body to before the if-statement if that could lead to observable side effects when the if-condition evaluates to false. When the if-condition evaluates to true and any reader load-acquires a non-null instance pointer, the reader is guaranteed to see the fully constructed widget due to the store-release being sequenced after the new widget(). When the if-condition evaluates to false (after a thread saw a null instance), the thread load-acquires the instance pointer in a spin loop until it sees a non-null instance (which then again must be fully constructed). So, I still don't see why a relaxed exchange wouldn't be enough, what am I missing?

  • User profile image

    Thank you. I am enlightened now!

  • User profile image
    Kal Sze

    Could somebody clarify whether the ARMv8's new atomic store-release and load-acquire instructions are available only as 64-bit instructions or also as 32-bit instructions?

Add Your 2 Cents