C++ and Beyond 2012: Herb Sutter - atomic<> Weapons, 1 of 2

Herb Sutter presents atomic<> Weapons, 2 of 2. This was filmed at C++ and Beyond 2012. As the title suggests, this is a two part series (given the depth of treatment and complexity of the subject matter).
STOP! => Watch part 1 first!
Abstract:
This session in one word: Deep.
It's a session that includes topics I've publicly said for years is Stuff You Shouldn't Need To Know and I Just Won't Teach, but it's becoming achingly clear that people do need to know about it. Achingly, heartbreakingly clear, because some hardware incents you to pull out the big guns to achieve top performance, and C++ programmers just are so addicted to full performance that they'll reach for the big red levers with the flashing warning lights. Since we can't keep people from pulling the big red levers, we'd better document the A to Z of what the levers actually do, so that people don't SCRAM unless they really, really, really meant to.
Topics Covered:
The link to part one under the "More episodes in this show" section of this page doesn't work. There's probably some HTML escaping bug involved as the angle brackets are not displayed correctly in the link text.
@ajasmin: Where is this link?
C
@Charles http://imgur.com/aUfHGnH Actually, I think this is a link to the same page we're on, it makes sense that it isn't click-able but there's still a minor problem with the escaping..
I feel bad bringing this up in the comments as it doesn't relates to the contents. If possible feel free to delete these comments after the fact.
The talk itself was the most thorough presentation I've seen on the topic. Quite enlightening.
@ajasmin: Thanks for the bug report!
C
@Charles The page title has the same problem: http://imgur.com/KwHatso
@Charles:
All better now. Actually wasn't an encoding issue. Rather the way we were truncating the title. Anyway, this issue is fixed and deployed.
Thanks for pointing this out ajasmin.
@ajasmin:
Well, got the first issue fixed
Thanks for noticing the title. We will clean that one up too
AWESOME, learned a lot.
The bad thing is, now I fear writing any multithreaded code since I'm not using a C++11 compiler! coding without a memory model is so scary ! How many bugs are there in VC10 ? VC9 ?
Herb, on your object layout considerations slide (2).
What if you have this:
char s[2]; //or char* s = new char[2];
Thread 1
{ std::lock_guard<mutex>(s0mutex);
s[0] = 1;
}
Thread 2
{ std::lock_guard<mutex>(s1mutex);
s[1] = 1;
}
And assume the processor can only do 32 bit reads/writes.
Does this mean that the compiler must order every character on a 32 bit boundary, essentially increasing the size of every character array/string by 4?
Was this true before the C++11 memory model was introduced?
And following up on my previous post. If all of the sudden characters have an alignment requirement of 4 bytes, then one can no longer do IO of binary data into character arrays.
@1:19:00
while (!stop.load<relaxed>()) { <= what keeps the load from being moved out of the while altogether and the thread never stops?
@Matthias
The consensus seems to be that the compiler can't completely optimize away the load operation from the loop, because the standard states in 1.10/25 that "An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time."
See also the comments by Bartosz Milewski and Anthony Williams here:
http://stackoverflow.com/questions/8819095#comment-11038619
And the "memory clobber" note here:
http://david.jobet.free.fr/wiclear-blog/index.php?title=2010-10-17-c%2B%2B-atomic-lib-impl-rules&mode=print&lang=fr
I do also have a question regarding the slide at 1:19:00 (page 48 in the linked slides):
Why can't the "stop = true" assignment be relaxed when it doesn't publish data?
And I have another question regarding a slide that Herb skips over at 1:25:48 (page 54 in the linked slides) and that includes this code for lazily creating a widget instance:
atomic<widget*> widget::instance = nullptr;
atomic<bool> widget::create = false;
widget* widget::get_instance() {
if (instance.load(memory_order_acquire) == nullptr)
{
if (!create.exchange_exchange_explicit(true, memory_order_relaxed))
{
instance.store(new widget(), memory_order_release);
}
else while(instance.load(memory_order_acquire) == nullptr) {}
}
return instance.load(memory_order_acquire);
}
The slide states that using relaxed in the exchange operation is bad because the code "could do some widget creation if CAS fails - and worse". However, I'm having difficulties seeing how the relaxed exchange in the if condition could result in observably bad behaviour. Can anyone help me out?
BTW, somebody should point out that there currently is a severe problem with the code generation for atomic operations by VC++2012: http://connect.microsoft.com/VisualStudio/feedback/details/770885/std-atomic-load-implementation-is-absurdly-slow
@Matthew: True and new in C++11 (and C11 and Java), but nothing to worry about because you'll likely never encounter a processor that can't do single-byte loads and stores.
@Matthias: The memory model says that compilers/processors/caches can't transform potentially infinite loops into non-infinite loops.
@Stephan, quick responses:
Herb, thank you for your reply!
Regarding a relaxed stop=true on page 49: launch_workers() and join_workers() should normally both synchronize with the launched and joined workers, so shouldn't that prevent the worker threads from seeing the assignment too early or too late?
Regarding the relaxed exchange_explicit on page 54: If I'm not mistaken, the compiler or CPU can't move anything from the if-body to before the if-statement if that could lead to observable side effects when the if-condition evaluates to false. When the if-condition evaluates to true and any reader load-acquires a non-null instance pointer, the reader is guaranteed to see the fully constructed widget due to the store-release being sequenced after the new widget(). When the if-condition evaluates to false (after a thread saw a null instance), the thread load-acquires the instance pointer in a spin loop until it sees a non-null instance (which then again must be fully constructed). So, I still don't see why a relaxed exchange wouldn't be enough, what am I missing?
Thank you. I am enlightened now!
Could somebody clarify whether the ARMv8's new atomic store-release and load-acquire instructions are available only as 64-bit instructions or also as 32-bit instructions?
@herbsutter, I know this is so old, but on slide 47, why does dirty need to be atomic at all? I'm assuming there are no reads of dirty in the threads, only writes of true (never writes of false), and that the join ensures that dirty has been written to memory. Also, it is bool and can't tear.