Programming in the Age of Concurrency: The Accelerator Project

Niners, please play with this (the C#/VB.net library may be easier to grasp given that Haskell is purely functional...). Tim and Simon need real world feedback. How would you use atomic blocks in your applications? What scenarios make use of atomic hard for you? See the links in the video post description. Also, please read the Composability paper (you don't have to be a scientist to understand it...)
Play with STM!
C
staceyw wrote:1) I missed a fundamental idea I think. It sound as if the last writer wins. So if I have an atomic block, I can make changes to state. But another thread can write to the state and commit, but the first atomic block will then rollback. What did I miss?
staceyw wrote:3) How does long running I/O and DB transactions play with this. So in a normal app, I would need both to be atomic. I need the in-memory objects to be atomic and the DB to be atomic in a single transaction. How does this work?
It seems to me that STM creates new problems with composability. You create two classes of code: atomic methods and non atomic methods.
Nonatomic methods can easily call atomic ones – the compiler could even automatically inject the atomic block if the programmer forgot.
Atomic methods and blocks cannot be allowed to call nonatomic code. The nonatomic code could do I/O or other irrevocable things that would be duplicated when the block had to retry.
This creates the exact problem that the “const” modifier has in C++. Your library writer has to anticipate everything you may want to do inside a transaction and provide you with an atomic version. By liberally making methods atomic, the author seriously limits future versions of the library because making a previously atomic method non-atomic is a breaking change.
In C++ they cop out and let you cast away the “constness” of a variable because there were too many places where the composition, exactly the problem you are trying to fix, didn’t work when you had one class of code that couldn’t call another class of code.
Its worse for STM because you can’t even cop out and say “use this sparingly.” Letting a transaction call nonatomic code creates the very race condition that you are trying to avoid, and makes them worse by running code many times that you have told the programmer to think about as only running once.
I am certain this will not be the first time this problem has come up, especially since you mention statically enforcing it in the compiler. I am just curious to hear what people think about what it will do to composability.
staceyw wrote:1) I missed a fundamental idea I think. It sound as if the last writer wins. So if I have an atomic block, I can make changes to state. But another thread can write to the state and commit, but the first atomic block will then rollback. What did I miss?
staceyw wrote:2) Charles, you had a good question that don't think was fully answered in the vid. Doing async in the block.
atomic()
{
// Do some stuff
BeginXXX(
delegate(int i)
{
// Do some stuff in callback.
}
}
We do some stuff and then kick off an async operation. But then we exit the atomic block right away because the async operation does not block. So the callback is not executed in atomic anymore. And worse yet, the code appears as if it is still executed in the block. What I am missing? Or was the answer just to never do this?
staceyw wrote:3) How does long running I/O and DB transactions play with this. So in a normal app, I would need both to be atomic. I need the in-memory objects to be atomic and the DB to be atomic in a single transaction. How does this work?
Excellent video!
I downloaded the C# code and took a quick look.
Good work with the runtime proxy generation. Geez, it must have been painful writing all that emit IL code!
I'll take a look at writing some Nemerle macros to make the syntax nice. I'm thinking macros along the lines of:
foo = atomic(Foo(arg1, arg2))
<==>
foo = XAction.MakeFactory(typeof(Foo)).Create(arg1, arg2) : Foo;
-------------------------
atomic(foo.Bar(x,y,z))
<==>
XAction.Run(XStart(foo.Bar), x, y, z);
--------------------------
retry
<==>
XAction.Retry();
--------------------------
y = attempt { Get1, Get2 } (arg1, arg2)
<==>
y = XAction.Run(XAction.OrElse(new XStart(Get1), new XStart(Get2)), arg1, arg2) : int;
y = attempt {Get1, Get2, Get3} (arg1, arg2)
<==>
y = XAction.Run(XAction.OrElse(XAction.OrElse(new XStart(Get1), XAction.OrElse(new XStart(Get2), new XStart(Get3)))), arg1, arg2) : int;
Suppose I have a cache of some calculation, that I calculate in a lock with double checking to be thread safe, like this:
private static readonly Dictionary<Type, int[]> _typeInts = new Dictionary<Type, int[]>();
public int[] TypeInts
{
get
{
Type thisType = this.GetType();
if (!_typeInts.ContainsKey(thisType))
{
lock (_typeInts)
{
if (!_typeInts.ContainsKey(thisType))
{
_typeInts[thisType] = <<calculate int[]>>;
}
}
}
return _typeInts[thisType];
}
}
Would I be able to replace this with:
private static readonly Dictionary<Type, int[]> _typeInts = new Dictionary<Type, int[]>();
public int[] TypeInts
{
get
{
atomic
{
Type thisType = this.GetType();
if (!_typeInts.ContainsKey(thisType))
{
_typeInts[thisType] = <<calculate int[]>>;
}
}
return _typeInts[thisType];
}
}
my intuition says it should work if I do all looking up and updating of the _typeInts inside an atomic block.
But suppose the ContainsKey is run and another thread interrupts before the int[] is calculated. The other thread runs the same atomic operation to completion, would the first atomic opration retry because some state it read was updated or would it throw an
exception when trying to insert a duplicate key?
ebdrup wrote:my intuition says it should work if I do all looking up and updating of the _typeInts inside an atomic block.
But suppose the ContainsKey is run and another thread interrupts before the int[] is calculated. The other thread runs the same atomic operation to completion, would the first atomic opration retry because some state it read was updated or would it throw an exception when trying to insert a duplicate key?
I can see how STM could have some performance implications, I mean if the atomic blocks get large ther is a lot of bookkeeping about what memory has been read and written. I can also see there is a lot of room for smart soultions in reducing the amount of bookkeping nessesary. But I love the power of the concept of transactional memory.
I'm wondering, couldn't STM have been implemented by you inserting the locks nessesary or probably better, a combination of locks where they would be smart and not deadlock and reevaluating where that would be deemed smart. I mean it's easy to think up scenarioes where there would be performance problems with reevaluating and you need a lot of good heuristics to ensure good performance in most cases. If you let the compiler insert the locks nessesary you can have it analyze for deadlocks at least in some cases and you eliminate the problem of having to keep a global order of locks that you talk about in the video.
Why did you opt for reevaluating instead of inserting locks. (guess the answer is composability)
jan.devaan wrote:Can we have it tomorrow?
may I suggest that you all visit comp.programming.threads once and a while… We have the goods wrt scalability and throughput. We will give you the honest appraisal, and not mislead you with TM!
Thank you,
Chris Thomasson
http://appcore.home.comcast.net/
(portable lock-free data-structures)
cristom wrote:http://groups.google.com/group/comp.arch/msg/e0958ecf43f95f51
Any thoughts?
P.S.--may I suggest that you all visit comp.programming.threads once and a while… We have the goods wrt scalability and throughput. We will give you the honest appraisal, and not mislead you with TM!
Thank you,
Chris Thomasson
http://appcore.home.comcast.net/
(portable lock-free data-structures)
“Lock-less programming is all well and good for a few very specific cases, but you still don't get composability (which is the whole point of using transactions).”
I completely disagree. There are many different *types of lock-free algorithms you can use to solve many of the fundamental problems that are attributed to the Reader/Writer Problem:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/576ad37c34dd673a
This particular race-condition is common to basically any multi-threaded application. So lock-free is more useful than you think… As for composability, well I have invented a lock-free eventcount algorithm which can be used as a completely composabile signal with atomic operation free fast-pathing on every call. This algorithm can work with basically any lock-based or lock-free algorithm. The heavy fast-pathing makes it work with lock-free. So, your assertion that lock-free algorithms are not composaible is fundamentally flawed.
“Also, the performance argument is inherently flawed. First of all, "sufficient performance" is a moving target. If I can make use of four times as many threads in my program, but it runs twice as slow, then it's still a win (if not now, then in a few years). Also, don't forget that STM basically gives you exception safety for free.”
Nonsense, I advise you to take the challenge on my appcore site:
http://appcore.home.comcast.net/
Have fun trying to get the STM version of a FIFO queue to compete with an algorithm that has no atomic operations (e.g., interlocked), and no #StoreLoad or #LoadStore memory barriers. This is just one example. If you use STM, then I can always design something that will have better scalability, throughput and overall performance characteristics. There is another class of lock-free algorithms that allow you raw pointer access into mutable shared data-structures’ without using any atomic operations or #StoreLoad, or even #LoadStore style memory barriers. There are many different ways to implement this. I suggest that you search Google groups, in comp.programming.threads, for ‘Partial Copy-On-Write Deferred Reclamation (PDR)’…
http://groups.google.com/group/comp.arch/search?hl=en&group=comp.arch&q=vzoom&qt_g=1
… Lock-Free atomic reference counted pointers:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/f2c94118046142e8
http://appcore.home.comcast.net/vzoom/refcount/
, please note that Boost shared_ptr<> doesn’t for this kind of stuff because its thread safety level is Basic.
… Fast-pathed read-write locks:
… Mutexs, or 100% lock-based STM:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0c011baf08844c4
(a simple multi-mutex algorithm)
“I also think it's a misstake to consider transactions for everything. I understand why this is such a common knee-jerk reaction since most programmers are more familiar with imperative languages, but in my opinion using a language whose fundamental method of computation is "modify state" in a multithreaded environment where the state is shared and all the threads depend on the state remaining consistent from their point of view, is pretty much a bad idea. You're just asking for problems.“
Speak for yourself. No reason to fear mutable shared data-structures’. Heck, you want me to give one of the main fast-paths for my database server? Oh boy… Its made up of PDR and huge lock-free shared linked data-structures… STM will livelock for the kind of stuff I am programming unless it folds under its ‘contention-manager security blanket’ under load. This in unacceptable, IMHO, anytime your application “depends” on an explicit contention manager, IMHO, you know you have problems…
“STM really begins to make perfect sense when you consider it in a purely functional context. You may sprinkle in some imperative nuggets here and there when needed for algorithmic reasons (see the ST monad in Haskell - imperative code in a pure setting with static guarantees that no side effects leak out). These small sections of imperative code wouldn't be parallellized, of course. Then the vast majority of your code uses the pure functional approach with parallellism (e.g. using the "par" keyword in Haskell, or some of the new data parallel stuff), and then at the very bottom you have some thread based shared memory concurrency where it makes sense, which is where STM comes in.”
Humm… STM is using shared memory so you can’t get rid of it at all. Plus, if you put it in the hardware, the cache coherency mechniasism would simply have to be really strict. I am in favor of weak cache and well defined/documented memory model, not monolithich cache coherence designs! Nasty!
“I don't think shared state concurrency with threads is the best way to do concurrency, but I don't think we can do completely without it either. And for that code, which in typical programs would be a very small fraction, you need something clever to get composability - STM fits the bill perfectly.”
You have to learn the fundamentals’ of multi-threaded programming. I recommend “Programming with POISX Threads” by David Butenhof. It’s excellent. Don’t sacrifice yourself to composability or nothing programming paradigm! You will end up binding yourself to tons of overhead. STM will lead you down the wrong road, IMO. You can always design lock-based algorithms with POSIX that consistently outperform their well crafted STM equivalents. STM has to orchestrate to many things. Its kind of like a giant LL/SC atomic operations. Of course… LL/SC is subject to livelock, even in hardware. Special care has to be taken. You have to consider the size of the reservation granularity. The larger the granule, the more prone to livelock. STM is a giant LL/SC with the reservation granule being every single operation you make. If there are any interferences, your transaction, or LL reservation, gets shot in the head!!
;^)
STM, well, that’s not an ideal synchronization scheme at all. IMHO, of course.
-- P.S.
Please visit comp.arch, or comp.programming.threads some time…
Cristom, I'm not a huge fan of your style of argument. You take on a very superior tone (such as assuming I don't know anything about multithreading) and make your point by posting tons of links, which makes following your argument very time consuming if not impossible. No doubt that leads you to "win" a lot of arguments by walk-over, but you're not actually convincing anyone. Summarize your point in way which is relevant to this discussion, don't assume I'm willing to get involved in 20 previous discussions to try to deduce your point from there.
However you're missing the target completely. I already conceded that lock free programming is all well and good in certain cases, so there is no point linking a whole bunch of lock free abstractions. What I claimed was that while it is possible to write
a lock free abstraction that will run faster than an STM based one (as you point out), you can trivially write an STM transaction that simply is not expressible with lock free programming. STM gives you
general composability (i.e. always), lock free programming does not - which does not mean that it's impossible to write composable abstractions in specific cases (i.e
sometimes).
You also seemed to have misunderstood my point about using STM in purely functional programming because I can't make sense of your response to that. You're not making the erronous assumption that "purely functional programming == no mutable data" are you? Pure
FP simply means that there are strong static guarantees that will ensure that a pure function won't use mutable state in any way which interferes with other pure functions (read about the ST monad). You also need some "impure functions" (really they're just
pure values which represents impure functions through monads - so it's still pure) to do IO and some of the low level stuff - and
that's the only place you'd use STM. So even if you think performance is horrible for STM, a typical functional program only has a handful of places where it would be useful anyway, so it's not a big problem -- 90% of the program is purely functional
and thus has no problem with parallellism (which is different from concurrency). Most of my FP applications have no mutable variables at all, and the ones that do typicall have one or two or so (usually the ones with GUIs or lots of IO stuff). This is an ideal
setting to introduce STM.
“Cristom, I'm not a huge fan of your style of argument. You take on a very superior tone (such as assuming I don't now anything about multithreading)”
IMHO, you were coming across as if mutable shared data is too advanced or something… The ability to access mutable data without using atomic operations and memory barriers is key to scalability. This is impossible with STM. Period. End of story. BTW, if
you know of a TM implementation that doesn’t make frequent use of those expensive operations, please post a link. I would be interested. Indeed.
“and make your point by posting tons of links, which makes following your argument very time consuming if not impossible.”
If you follow those links, you should have a better understanding of STM…
“No doubt that leads you to "win" a lot of argument by walk-over, but you're not actually convincing anyone."
I usually convince those who follow the links and do some research. There is more than one issue involved.
“Summarize your point in way which is relevant to this discussion, don't assume I'm willing to get involved in 20 previous discussion to try to deduce your point from there.”
The summary would be a couple of pages long… I would rather post links to prior art.
“However you're missing the target completely. I already conceded that lock free programming is all well and good in certain cases, so there is no point linking a whole bunch of lock free abstractions.”
Just showing you what’s out there. The video did not point out any alternatives’. I felt I just had to.
“What I claimed was that while you can write a lock free abstraction that will run faster than an STM based one (as you point out), you can trivially write an STM transaction that simply is not expressible with lock free programming.”
Well, the STM can be lock-free, although they are usually only obstruction-free but that’s another matter. I would post a link, but you don’t like that. So research on your own I guess. So, you are incorrect to say that you can do stuff with STM that you can’t
do with lock-free programming; a STM can be based on a lock-free algorithm. See my point here?
“STM gives you general composability (i.e. always), lock free programming does not - which does not mean that it's impossible to write composable abstractions in specific cases(i.e sometimes).”
Its not impossible. The price you pay for general composability is far to great. BTW, check this out:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a0966697738790c4/
you can get composability with POSIX mutexs and condition variables. The video did not mention any major alternatives’. Unimpressed.
“You also seemed to have misunderstood my point about using STM in purely functional programming because I can't make sense of your response to that. You're not making the erronous assumption that "purely functional programming == no mutable data" do you?”
You abstract the shared memory away so much that all of its advantages are rendered useless by all of the massive amounts of coherency traffic that the TM generates.
The video mentions overheads, and seems to fall back on the good ol’ saying:
“Oh well, we will wait for hardware support”
Man… TM on the hardware is a no go.
The cache coherency mechanism that could properly support a robust transactional memory scheme would have to be really strict IMHO... Think of scenarios in which there could be lots of reader threads iterating over large shared linked data-structures in parallel...
The transactional coherency system could potentially wind up having to track all of those many thousands of reads transactions' that will be generated my the frequently reading threads... It would die from livelock; ahh the explicit contention manager to the
rescue! So, when the times get tough for TM, it basically is forced to fold under its security blanket, and wait for things to cool way down for a while... This is because TM is kind of like obstruction free algorithms... One small trivial interference from
another thread, and the operations dies and asks the contention manager if it can retry... I don't think TM can scale very well at all wrt the previous scenarios that I briefly described...
That is a lot of work for a synchronization instruction... Whatever happened to plain old CAS!
:O
IMHO, they have to implement transactional memory in the hardware if they want it to be even remotely usable. I can make use of a transactional instruction in vZOOM, however I would try to avoid its use at all costs...
The transactional instructions have simply got to be more expensive than a "normal" CAS or LL/SC... I would not allow my reader threads to use the transaction instructions when they are traversing through large data-structures. This is not because I don't like
TM; I don't let my reader threads use CAS, or any atomic operations for that matter... Reader threads don't seem to like to call these types of instructions. The amount of searches' they can perform per-second getschanges drastically when CAS and/or memory
barriers are used. I could only imagine the overhead involved with many reader threads frequentlysearching/traversing throughout a data-structure that has hundreds-of-thousands or evenmillion+ nodes during periods of load. Design a TM can be tedious.
My main problem with TM is that reader threads have to call into the transaction api before they traverse a data-structure. In most cases every read has to be tracked; validating at the page-level is too coarse. If they end up traversing tens-of-thousands of
nodes, that's a lot of checking/validating. If the application is under load, and writer threads modify the page of memory that a reader thread happens to be reading from
seems to raise a red-flag in the presented algorithm. Then the validation process has to drill down on the page, to enhance transaction granularity. If alls that it takes is a simple store into the page to abort a transaction, then live-lock can and should
occur... So they have to get more fine grain than a page...
Cache-coherency protocol deals with transaction that fit into cache line... Humm, this may make the protocol too complex/strong. I am NOT to a big fan of extremely strong cache. t a shame than TM might be responsible this nosense with the strong cache. I want
an architecture very weak cache coherency, with crystal clear memory model documentation. I would be fine with the Alpha. But STM on an Alpha would be a joke. Heck, can’t even use it for any sort of distributed algorithms either.
Any thoughts?
I really think you're not really trying to see the point here. It also seems that you may have a financial interest in not seeing the point. Which makes arguing with you pretty much a waste of time.
Look at the video again. They specifically say that skilled programmers can use existing techniques to write correct code - they're not claiming that it's not possible to work around the difficulties of multi threaded programming with existing techniques. The
point here, though, is that it is difficult. Your own posts is are a case in point. Why are you so proud of your inventions of various lock-free data structures if it's so easy? Because it isn't very easy, which is the whole reason for why you want
STM instead.
Getting everything right with locks or lock-free programming techniques is difficult - if you're selling software based on this I can understand why you would argue otherwise, but I'm not going to waste my time arguing against a paycheck (everyone whose salary
isn't based on pretending it's easy will agree that it's difficult). Not impossible, but difficult. Your argument that it's possible to use other strategies is completely pointless. It's possible to write a full comercial business application using nothing
but assembly as well, but is it practical? No not really, and as concurrency becomes more common using the sophisticated and clever lock-free programming techniques will stop being practical as well.
Also, I don't think you're using the word "composability" in the same way I am (nor indeed the way they use it in the video). E.g. my point is that if you have language support for STM, you take a hash table, then you can do the operation "remove A insert B"
without leaving the intermediate state visible (then you can do that in combination with other techniques, without leaving any intermediate states visible). If the hashtable is written using lock free and/or lock based abstractions to be thread safe for insertions
and deletions, that still doesn't mean you can do the operation you want in an atomic way.
What if I want to read 4 elements, or none at all, from your nice lock free FIFO queue? What if I want to do that operation on two different FIFOs (from different vendors), or none at all? What if I want to combine
that with some operations of my own, or do none at all?
STM composes because you can create new concurrency abstractions from existing ones without having to have access to the source code of the original abstractions.
cristom wrote:
IMHO, they have to implement transactional memory in the hardware if they want it to be even remotely usable.
For those who don't follow links, here are some important snippets from some of them:
> I am wondering where RW locks are appropriate ??
Basically, they work fairly well with "read-mostly" data-structures. If you
don't really care about performance and scalability, then most rw-mutex
implementations should work perfectly fine; however, in some cases they are
definitely not an "optimal solution". For instance, IMHO a rw-mutex would
probably not be an ideal candidate for protecting a "critical shared
collection" than can experience episodic periods of sustained
"moderate-to-heavy write activity". Reader threads that use rw-mutexs to
protect such a container would probably not perform and/or scale very well
at all with regard to adding cpus. One of the reasons why is usually due to
the fact that most rw-mutex implementations force reader threads to use some
rather expensive atomic/membar/blocking methods in order to enforce
synchronization and proper data-coherency among the participating
processors; this is not very cache friendly, not at all. Reader threads
would not really be able to keep the processors "caches primed" and probably
would not experience the full benefits of a modern multi-processor system.
For instance, a reader thread that "frequently" executes a store/load style
membar would constantly be "invalidating and thrashing the cache"; not good
at all! The thread could spend most of its time "waiting" for the
modifications to the "rw-mutexs internal state" to become "fully visible" to
all of processors involved. This process is usually made up of fairly
expensive operation(s) that may involve the overhead of cache snooping
and/or locking the system bus. The overhead involved could actually prevent
reader threads from truly exeuting in parallel; not good.
Therefore, IMHO, a rw-mutex is generally a bad solution for a critical
shared data-structure if the overhead of acquiring and releasing it is
greater than the overhead of "reading the data-structure in a
single-threaded application"...
> is it appropriate when only you don't want 'dirty' reads??
Not really, but yes they do eliminate the possibility of processing "stale
data"; however, as stated above the methods involved are usually expensive.
Fortunately, there is a more advanced class of algorithms out there that do
not tend to add a boatload of extra overhead and can allow reader and writer
threads to execute in parallel. Please note that this behavior alone can be
beneficial to many application designs. For instance, since readers no
longer block writers an existing synchronization design can usually be based
on a "much simpler locking scheme"; however, stale reads can and will occur
in this type of setup. Luckily, one can address stale/dirty reads in a
number of interesting ways:
http://groups.google.com/group/comp.programming.threads/msg/523aa1e77...
Another solution for stale reads involves writer threads making copies of
existing shared data-structures and then atomically swapping them with the
existing ones. A garbage collection scheme can be used to free the "old"
data-structures. This solution does not add overhead to reader threads,
unless, the method of garbage collection used is sub-optimal...
"John Hickin" <hic...@nortelnetworks.com> wrote in message
news:ec4u46$hvv$1@zcars129.ca.nortel.com...
> "David Schwartz" <dav...@webmaster.com> wrote in message
> news:1155889979.905637.238660@i42g2000cwa.googlegroups.com...
[...]
http://www.computer.org/portal/site/computer/menuitem.5d61c1d591162e4b0ef1bd108bcd45f3/index.jsp?&pName=computer_level1_article&TheCat=1005&path=computer/homepage/0506&file=cover.xml&xsl=article.xsl&
("The Problem with Threads" - IEEE's Computer magazine article)
> This example is quite good and it explains what I don't like with threads.
> In the real world, 8 of the 10 people in your example will go on to do
> something else. In a multi-threaded world, 8 threads are blocked.
http://groups.google.com/group/comp.programming.threads/msg/a3e38e27d...
( refer to the pseudo-code at the bottom of the msg )
Any thoughts?
> Unless, of
> course, you use some sort of lock-free algorithm. Unless I've missed the
> mark, lock-free is quite avant guard, and not taught at the undergraduate
> level. It may be time for a change there.
Totally agreed:
http://groups.google.com/group/comp.lang.c++.moderated/msg/d07b79e963...
FWIW:
I don't agree with the commonly assertion that lock-free programming is far
to complicate for a "normal" programmer to learn.
I also don't agree with the assertion that threads are vastly too difficulty
and fragile for any programmer to use... The paper the OP linked to seems to
attempt to make this type of assertion.
> Anyway, the article was interesting to read. The author is an IEEE Fellow,
> and you don't get that designation without earning it.
Humm... Did you read the part where is basically says that programmers that
use threads are insane? He points to a "folk saying",
'repeatedly performing the same action and expecting different results
defines a form of insanity'
(paraphrase)
The he says this, again paraphrasing
'if programmers that use threads were not insane, they simply would not be
able to understand any of their programs'
I wonder if he was being serious... Anyway....
Yikes! I guess that I am INSANE!!!!!!!!!
YeeeHaw.... Weeeee.....
I find it amusing that the IEEE Fellow seems to not be able to create a
thread-safe observer pattern... I have several lock-free observer patterns
that can rely on my vZOOM or Joe Seighs RCU-SMR algorithms.
For some reason, I don't think that the author could create a highly
scaleable thread-safe, lock-based or lock-free, observer pattern
implementation if his life depended on it???
:O
(simply EXECELLENT Wheeler posts!!!)
"Chris Thomasson" <_no_spam_cristom@no_spam_comcast._net> writes:
> IMHO, I believe its too early in the game for actual hardware
> support, even though they really do need it in order for TM to
> become really useful. All of the software emulations I have seen are
> loaded with expensive atomic operations; mutexs seem more
> efficient...
original 801 had a form ... it was used by journal filesystem
for aix in rs/6000.
filesystem metadata was organized ... and the memory system caught all
changes ... so basically you avoided having to do all the explicit
logging/locking statements in the filesystem code ... you did have to
add commit calls.
there was some disagreement whether the overhead of explicit
logs/locks were more expensive than the implicit overhead involved in
transaction memory. turns out there was the overhead of scanning the
whole memory area for changes at commits.
a portable version of the journal filesystem code with explicit
lock/logging changes in the code turned up the explicit lock/logging
calls had lower overhead than transactional memory (at least in the
journal filesystem case).
and unrelated old reference from about same period as aixv3 ... the
Proceedings of the 20th International Symposium on Fault-Tolerant
Computing Systems, pp 89--96, Newcastle, June 1990.
A Fault Tolerant Tightly Coupled Multiprocessor Architecture based on
Stable Transactional Memory
Authors: Michel Banatre and Philippe Joubert
Abstract:
Traditionally, tightly coupled multiprocessors allow data sharing
between multiple caches by keeping cached copies of memory blocks
coherent with respect to shared memory. This is difficult to achieve
in a fault tolerant environment due to the need to save global
checkpoints in shared memory from where consistent cache states can be
recovered after a failure.
The architecture presented in this report solves this problem by
encapsulating the memory modifications done by a process into an
atomic transaction. Caches record dependencies between the
transactions associated with processes modifying the same memory
blocks. Dependent transactions may then be atomically committed. Such
an operation requires a cache coherence protocol responsible for
recording process dependencies as well as keeping coherent cached
copies of blocks and a shared Stable Transactional Memory owing to
which all memory updates are atomic to allow recovery after a
processor failure.
--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/
"Chris Thomasson" <cris...@comcast.net> writes:
> Good point. Do you happen to know if they implemented it so they could
> successfully pop from a lock-free stack? I wonder....
charlie had invented compare&swap as part of his work on fine-grain
locking (leading to some number of lock free operations) for cp67
(360/67) mutliprocessor support at the science center
http://www.garlic.com/~lynn/subtopic.html#545tech
the trick then was to come up with a mnemonic that matched Charlie'
initials, CAS.
the attempt was then made to get the instruction into the up and
coming 370 architecture. working with ron smith in the pok 370
architecture group (they owned the 370 architecture "red book"), the
response was that 370 didn't need another multiprocessor specific
instruction, that the test and set from 360 was sufficient.
to get compare and swap into the 370 architecture we had to come up
with useage for compare&swap that wasn't multiprocessor specific.
thus was born some number of examples that were applicable to
multi-threaded applications that might be running enabled for
interrupts ... independent of whether the machine was single processor
or multiprocessor.
originally in the 370 principles of operation, the examples were part
of the programming notes that were part of the compare&swap
instruction. in subsequent version of the principle of operations the
examples were moved to a section in the appendix.
also as part of this activity, compare&swap double instruction was
added in addition to compar&swap. that resulted in two instructions
for 370, compare&swap along with compare&swap double ... so the
instruction mnemonics become CS and CDS (instead of CAS ... defeating
the original objective of coming up with instruction name
compare&swap).
total topic drift ... science center was like that ... GML
http://www.garlic.com/~lynn/subtopic.html#sgml
precusor to SGML, HTML, XML, etc ... also invented at the science
center, actually are the first initials of the last name of the three
inventors (and you probably thot it stood for generalized markup
language).
misc. past posts on multiprocessor support and/or compare&swap
instruction
http://www.garlic.com/~lynn/subtopic.html#smp
esa/390 principles of operation appendix for multiprogramming
(i.e. multithread) and multiprocessing
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/A...
cs & cds appendix:
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/A...
bypassing post and wait
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/A...
lock/unlock:
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/A...
free pool manipulation
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/A...
and more recent z/architecture (64-bit) principles of operation
multiprogramming and multiprocessing examples appendix
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9ZR003/A...
note that the above also includes discussion of the newer PLO ...perform lock
operation instruction
--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/
- conversation with Andy...
"Andy Glew" <first.l...@employer.domain> wrote in message
news:peyp8xnzl9g2.fsf@PXPL8591.amr.corp.intel.com...
>> > Are there a rule of thumb as to when LL/SC starts showing
>> > scalability issues?
>> IMHO there is no rule of thumb. Scalability issues are not typically
>> induced by LL/SC itself.
> I disagree.
> Although maybe it's not really scalability issues. LL/SC starts
> introducing deadlock / forward progress issues the moment you have
> more than one processor on any interconnect that is not a simple
> snoopy bus, and they just get worse as you add more, and have more
> complicated interconnects.
Something like this?:
http://groups.google.com/group/comp.arch/msg/d40cf092b679169d
Coarse reservation granularity can cause live-lock in LL/SC via. simple
false-sharing. I really do believe that HTM is going to suffer from the same
thing... Multiple processors can live-lock if they keep interfering with the
reservations that track the TM.
Humm... Kind of like the live-lock that is inherent in any obstruction-free
algorithm; they need to use "backoff-and-retry" algorithms to keep up
forward-progress...
Good response post form Intel Architect Andy Glew:
> Humm... Kind of like the live-lock that is inherent in any obstruction-free
> algorithm; they need to use "backoff-and-retry" algorithms to keep up
> forward-progress...
Yeah, I was thinking of correcting my older post to try to get to this.
It is not fundamentally LL/SC vs. CAS and other atomic RMWs.
It is more that LL/SC was originally proposed and implemented via an
"optimistic concurrency" type of address "link", whereas CAS and other
atomic RMWs are traditionally implemented via a "lock" - whether bus
lock, cache lock, or address lock.
If other people try to access the lock while the RMW is in flight they
are stalled or told to go away, come back later. This guarantees
forward progress.
Whereas if other people access the link between the LL and SC, it is
the SC that fails. In the presence of an adversary doing ordinary
stores, the LL/SC might never complete.
But these are just implementation artifacts. E.g. you can implement
CAS or an atomic RMW with LL/SC, whether in user code or microcode -
and such an RMW would have the same forward progress problems as
LL/SC. You can implement a hybrid - try the optimistic concurreny
approach first, and then try the lock approach.
Similarly, you can implement LL/SC by acquiring a lock, guaranteeing
that the SC will finish. But then you need some way of backing out,
protecting yourself against a malicious user who does the LL but never
the SC. E.g. the Intel 860 released such a lock after 32 instructions.
Chris Thomasson wrote:
> "Joe Seigh" <jseigh...@xemaps.com> wrote
>>What's with the Sparc architects?
> IMHO, I believe the major reason 32-bit architectures support DWCAS was to
> be "64-bit ready"; DWCAS gets converted to normal CAS on 64-bit systems. So,
> if your 32-bit applications used DWCAS, then they can be run 64-bit arch
> without changes... Of course there is a caveat...
> A 32-bit application cannot use DWCAS to manipulate a pointer along with
> another contiguous variable...
> struct dosent_work_on_64_bit_s {
> void *a;
> whatever b;
> };
> IMO, the hardware architects' did NOT have lock-free algorithms in mind when
> they decided to put DWCAS in their 32-bit instruction sets. The fact that
> 128-bit CAS is not reliably supported seems to support my opinion...
IBM S370 had double wide compare and swap long before 64 bit support
became an issue.
>>There's a bit of a disconnect here I think.
> Perhaps they are designing the lock-free algorithms for a upcoming processor
> design. Maybe one that has hardware transactional memory... That could be
> the reason they designed KCSS:
They've done a bit on STM (software transactional memory). If they did
come out with hardware based transactional memory it would be after the
fact of 64 bit sparc and wouldn't be generally available. So it would
have to be hidden behind some system or library api with alternate
implementations on non TM platforms. That would limit its applicability.
For example, you couldn't have atomically thread-safe refcounted smart
pointers. Well, not without RCU and/or SMR. If they go with STM to
compliment hw TM, it's only going to work at a lower contention level
if the STM algorihtm is only obstruction-free.
It's a big problem. You can design efficient synchronization mechanisms
that are portable if you ignore Sparc. If you want portability to
Sparc you have to sacrifice performance or functionality.
If Sun has a 64 bit JVM implementation, I wonder what they do for
the AtomicMarkedReference and AtomicStampedReference implementations
for non Sparc platforms. Cripple them so they perform as poorly as
the Sparc implementations?
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
Text taken from:
http://groups.google.com/group/comp.arch/msg/e1e391055c7acecb
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/9c572b709248ae64
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/f6399b3b837b0a40
http://groups.google.com/group/comp.arch/msg/1b9e405080e93149
http://groups.google.com/group/comp.arch/msg/bbbc035cf1a8502c
http://groups.google.com/group/comp.arch/msg/11b14c4bda2d5d82
http://groups.google.com/group/comp.arch/msg/9b00fda2752966f9
http://groups.google.com/group/comp.arch/msg/335aeb22fd6fe526
http://groups.google.com/group/comp.arch/msg/1ace9400b1b16cd4
http://groups.google.com/group/comp.arch/msg/995379a16beb3b69
(simply excellent Wheeler post)
and more...
Any thoughts?
;^)
cristom wrote:For those who don't follow links, here are some important snippets from some of them:
SNIP!
"I really think you're not really trying to see the point here. It also seems that you may have a financial interest in not seeing the point. Which makes arguing with you pretty much a waste of time."
IMHO, STM is a deceptive marking ploy. They never get into the alternatives’. BTW, I mention many different ways to do things. I did not just refer you to my vZOOM algorithm; I also refer you to Joe Seighs RCU+SMR algorithm, and some others. So you wrong again… Unlike the video, I point out all of the caveats, and explain them in detail.
"Look at the video again. They specifically say that skilled programmers can use existing techniques to write correct code - they're not claiming that it's not possible to work around the difficulties of multi threaded programming with existing techniques.
The point here, though, is that it is difficult. Your own posts is are a case in point. Why are you so proud of your inventions of various lock-free data structures if it's so easy?"
The complexity can be hidden behind a simple and portable API, that anybody can use. Like the pthreads api. We all know that a particular pthread implementation may be very complex, and may very well use some lock-free algorithms. But, does the average user actually care about this kind of stuff? My point is, again "The complexity can be hidden behind a simple API". Really.
Would there be bugs... Well, maybe; it depends on the implementer. There may be very subtle bugs in your pthread implementation as well...
"Because it isn't very easy, which is the whole reason for why you want STM instead."
No. STM is NOT a good solution if you’re interested in execellent scalability and performance.
"Getting everything right with locks or lock-free programming techniques is difficult - if you're selling software based on this I can understand why you would argue otherwise, but I'm not going to waste my time arguing against a paycheck (everyone whose salary
isn't based on pretending it's easy will agree that it's difficult). Not impossible, but difficult. Your argument that it's possible to use other strategies is completely pointless. It's possible to write a full comercial business application using nothing
but assembly as well, but is it practical? No not really, and as concurrency becomes more common using the sophisticated and clever lock-free programming techniques will stop being practical as well."
Utter nonsense. The power of lock-free programming techniques will be in demand. Man. Concurrency has been common for a very long time now. Ever heard of RCU? Heck, this stuff, including STM and HTM has been around for the past 25 years at least. Don't try to act like this is something new. Lock-free has been around for ages, and it not going to go out of practice. Its going to be in demand. That is, if you want your application to have excellent scalability and throughput characteristics
"What if I want to read 4 elements, or none at all, from your nice lock free FIFO queue?"
Simple, you use an eventcount.
"What if I want to do that operation on two different FIFOs > (from different vendors), or none at all? What if I want to combine that with some operations of my > own, or do none at all?"
This is example of bad programming practice. Having to force things together is a sign that there is a MUCH better way. The overhead involved with your request is too great. Again, I urge to you to post to comp.programming.theads and comp.arch. Please.
Anyway, you can use the eventcount to add fine-grain ultra low-overhead conditional blocking and coordination of your data with lock-free collections. Too much to post here. The technique deserves its own thread. STM can’t compare with this wrt scalability;
its dead in the water. The performance increases of most alternatives to STM are usually an order of magnitude.
"STM composes because you can create new concurrency abstractions from existing ones without having to have access to the source code of the original abstractions."
I am not willing to sacrifice my server applications performance. Sorry, not sold.
C is fine. You can create STM implementation in C/Assembly. Just follow the rules:
http://groups-beta.google.com/group/comp.programming.threads/msg/423df394a0370fa6
Okay… If I had to use STM, I would use it in C, not Haskell. Humm, I have created experimental STM that try to get around some of the overheads. Still not sure I want to Patent it or not. Perhaps I should patent it, and sell it as a much faster alternative to most of the STM implementations’. Humm. A couple of implementations of mine beat Haskell by a wide margin. Indeed.
Okay… If I can’t convince you the STM is not a good solution at all… How would you feel about using a STM algorithm with a pure C interface? I have several solutions. I created them way back when I was experimenting/inventing my vZOOM algorithm. I needed something to test it against. I was very disappointed with every STM implementation out there. So, I invented a couple of new ways to implement STM… I guess if STM is going to be rammed down programmer’s throats, I might as well get the experimental prototypes’ and post them here…
I have 2 STM word-based implementations. One is written is about 75% IA-32 assembly language, and the other is in about 90% SPARC V9 assembly language… The all have C API/ABI. Standard transaction API (e.g., begin, load, store, commit, abort, validate), and they all outperform Haskel.
Interested?
cristom wrote:
I am not willing to sacrifice my server applications performance. Sorry, not sold.
cristom wrote:C is fine. You can create STM implementation in C/Assembly.
...
Interested?
"And you don't have to. Writing a web server probably doesn't require STM at all."
I am mostly talking about database servers the make heavy use of shared memory for fast-pathing database queries. I can traverse mutable data-structures with atomic operations and/or memory barriers. Again, this is impossible with STM. Using STM would make that fast-path run very slow because of the vast amount of coherency traffic that the TM logic generates.
ACK!!!
"I am mostly talking about database servers the make heavy use of shared memory for fast-pathing database queries. I can traverse mutable data-structures with atomic operations and/or memory barriers. "
Is suppose to read as:
I am mostly talking about database servers the make heavy use of shared memory for fast-pathing database queries. I can traverse mutable data-structures "WITHOUT" atomic operations and/or memory barriers.
Sorry for any confusion.
"Actually I'm really not. I've posted twice why I don't think this makes sense in an imperative language."
I think we have to agree to disagree on this. I have no problems with using C for creating high-end multi-threaded applications.
"I would recommend you post it and point it in the direction of Simpon Peyton Jones and Tim Harris though. If your implementation really is as good as you say, and it does what the Haskell one does, I'm sure they will be willing to do it your way instead."
Its definitely as good as I say it is, however, it doesn’t have some of the features that Haskell has. You should think of my STM as a “lean-and-mean” minimalist version. I was trying to go for extreme speed in the STM implementation itself, some of the stuff that Haskell uses makes this impossible.
;^(…
Anyway… I will think this over. To patent, or not to patent. I wonder how popular a fairly speedy C/C++ STM alternative to Haskell would actually be. I personally would not even want to use my STM in any of my application infrastructures’. But if most of the programmers go for STM, I could do well by providing a top-notch minimalist alternative in C/C++.
Humm..
cristom wrote:ACK!!!
"I am mostly talking about database servers the make heavy use of shared memory for fast-pathing database queries. I can traverse mutable data-structures with atomic operations and/or memory barriers. "
Is suppose to read as:I am mostly talking about database servers the make heavy use of shared memory for fast-pathing database queries. I can traverse mutable data-structures "WITHOUT" atomic operations and/or memory barriers.
Sorry for any confusion.
Inconsistent data can, and will, be accessed in a transactional atomic block! A users algorithm for a transactional based system is based solely inside of an atomic block, e.g.,
atomic { // start transactional atomic block
users_algorithm{
// Access transactional data-structures
}
Wow... The implementations that I have seen have to actually catch any and all fatal exceptions that can, and WILL, arise when a user algorithm is subjected to inconsistent data... Sometimes your algorithm crashes right off the bat and gets intercepted by
the STM where the transaction will be aborted, and retried... Sometimes your algorithm will be stuck in an infinite loop (most common problem, IMHO) , inside the atomic block... STM has to watch out for this kind of stuff... IMHO, this is a MAJOR drawback
and a down right dangerous aspect of transactional memory... This behavior is inherent in basically any TM implementation; hardware or software...
Need to think some more on this troubling subject... Now you know just some of the reasons why I don't like TM...
Yikes!
:O
* I can't believe I forgot to mention this; totally slipped my mind. Right when I remembered this I went over to Wikipedia to see if anybody bring this fact up; somebody did... IMO, its vital that anybody who thinks about using TM completely understands
this particular caveat!
cristom wrote:One other MAJOR *point about TM... STM or HTM, my implmentation or not, both "suffer" from a not so well known phenomena:Inconsistent data can, and will, be accessed in a transactional atomic block! A users algorithm for a transactional based system is based solely inside of an atomic block, e.g.,
atomic { // start transactional atomic block
users_algorithm{
// Access transactional data-structures
}} // try to commit transactional atomic blockThis means that the users algorithm can actually "operate" on inconsistent data... No kidding, not at all... therefore, a users algorithm behavior is basically undefined, whether the programmer likes it or not... Think about it for a minute... If your algorithm is fed with "no-so coherent" data, then your algorithm will be subject to undefined behavior by default...Wow... The implementations that I have seen have to actually catch any and all fatal exceptions that can, and WILL, arise when a user algorithm is subjected to inconsistent data... Sometimes your algorithm crashes right off the bat and gets intercepted by the STM where the transaction will be aborted, and retried... Sometimes your algorithm will be stuck in an infinite loop (most common problem, IMHO) , inside the atomic block... STM has to watch out for this kind of stuff... IMHO, this is a MAJOR drawback and a down right dangerous aspect of transactional memory... This behavior is inherent in basically any TM implementation; hardware or software...
Need to think some more on this troubling subject... Now you know just some of the reasons why I don't like TM...
Yikes!
:O
* I can't believe I forgot to mention this; totally slipped my mind. Right when I remembered this I went over to Wikipedia to see if anybody bring this fact up; somebody did... IMO, its vital that anybody who thinks about using TM completely understands this particular caveat!
atomically $ do {
v1 <- readTVar t1;
v2 <- readTVar t2;
writeTVar t3 ( willDivergeIfInconsistent v1 v2 )
}
t1 and t2 will be validated before t3 is written to, and the writing of t3 is the only thing that will force willDivergeIfInconsistent to be evaluated, and at that point v1 and v2 are guaranteed to be valid, thus nothing will ever be evaluated with inconsistent
data.
This holds for every atomic computation, even with interleaved reads and writes, since all the data used by a function will be validated before it can be retrieved and thus must be at least consistent enough to execute the call properly.
One problem with implementing software transactional memory is that it's possible for a transaction to read inconsistent state (that is, to read a mixture of old and new values written by another transaction). Such a transaction is doomed to abort if it ever tries to commit, but it's possible for inconsistent state to cause a transaction to trigger a fatal exceptional condition such as a segmentation fault or even enter an endless loop, as in the following contrived example from Figure 4 of "Language Support for Lightweight Transactions":
atomic { if (x != y) while (true) { } } Transaction A |
atomic { x++; y++; } Transaction B |
Provided x=y initially, neither transaction above alters this invariant, but it's possible transaction A will read x after transaction B updates it but read y before transaction B updates it, causing it to enter an infinite loop. The usual strategy for dealing with this is to intercept any fatal exceptions and periodically check if each transaction is valid; if not, it can be aborted immediately, since it's doomed to fail in any case.
“E.g. let's say you have a game world, and 10000 game objects that interact with it in complex and data dependent ways (i.e. there is no way to partition the world into lockable regions). Each of these objects interact with about half a dozen of the other objects, and they do so concurrently, and the interactions must be atomic. How do you solve that with lock free data structures? The game world consists of multiple different data structures nested in complex ways (oct-trees, portal cell graphs, BSP trees, etc.). How do you solve it with locks? You really don't, unless you stick a big lock on the entire world, or use some global locking order or something equally detrimental to composability and productivity. STM in this case "Just Works".”
If you don’t know how to write a high-end multi-threaded game engine without using STM, that’s your problem. Game engines HEAVILY benefit from assembly language and other assorted low-level tricks and techniques. Low-overhead lock-free algorithms are a must. STM in a game engine? No way. I want my game to be very fast. STM would prevent the game engine from realizing the full potential of today’s high-and architectures…
BTW, how would STM work on the Cell Processor? I don’t think it’s even possible for STM to even being to make use of the Cells SPU’s. Ha…. You want to program games? Do you know assembly language inside in out? Do you know how to use DMA with the Cells AltiVec SPU to shunt from shared memory to local SPU memory? Do you know the PPC instruction set? Do you know distributed/network programming paradigm for the Cell? What about the XBox 360? Why bog down its PowerPC with all of the coherency traffic generated from the crappy TM?
You want to get into game engines? I suggest posting something overing comp.programming.threads and/or comp.arch. BTW, have you ever programmer the IBM Cell Processor or have you done any lock-free synchronization algorithms that use the PPC instruction set? Do you know the AltiVec instruction set? I have experience with all of these. I would not want you on the design team if alls you have to say to synchronization issues in the game engine is STM… No way no how!
How skilled are you wrt high-end game engine synchronization issues? They are basically the same issues that exist in the fast-pathed part of a database server… Anyway, please follow-ups in comp.programming.threads and/or comp.arch. This forum seems to get too crowed...
Humm…
cristom wrote:One problem with implementing software transactional memory is that it's possible for a transaction to read inconsistent state (that is, to read a mixture of old and new values written by another transaction). Such a transaction is doomed to abort if it ever tries to commit, but it's possible for inconsistent state to cause a transaction to trigger a fatal exceptional condition such as a segmentation fault or even enter an endless loop, as in the following contrived example from Figure 4 of "Language Support for Lightweight Transactions":
atomic { if (x != y) while (true) { } }Transaction A
atomic { x++; y++; }Transaction B
Provided x=y initially, neither transaction above alters this invariant, but it's possible transaction A will read x after transaction B updates it but read y before transaction B updates it, causing it to enter an infinite loop. The usual strategy for dealing with this is to intercept any fatal exceptions and periodically check if each transaction is valid; if not, it can be aborted immediately, since it's doomed to fail in any case.
You use the splendid "you must just be stupid" argument, which is just great. So game engines are not difficult to parallize if you're smart enough, thus using any technology which would make this easier is an admission of stupidity. Well that's just great.
Have fun with that.
You do know that the founder and technical director (Tim Sweeney)of Epic games (recently released Gears of War for the XBox 360) is a heavy advocate for purely functional programming and STM for games, right? If not, check out "The Next Mainstream Programing
Language, A Game Developer's Perspective". But I'm sure you know more than both him and me.
Also, I'm just going to stop arguing with you right here. I dislike your argumentation style, and you don't seem interested in learning anything new. Though I do think it's funny that you pretend to be an expert on game engines, and then proceed to share your
wisdom on the subject to educate poor me in case I ever want to work on games (for the record, I actually work on game engine technology for a living, for the XBox 360).
Why would you use STM in a game engine? And how would use use STM on the Cell? Perhaps I, or somebody else at comp.programming.theads can help you get a MUCH better design for any engine you may be working on.
"You do know that the founder and technical director (Tim Sweeney)of Epic games (recently released Gears of War for the XBox 360) is a heavy advocate for purely functional programming and STM for games, right? If not, check out "The Next Mainstream Programing Language, A Game Developer's Perspective". But I'm sure you know more than both him and me."
Yeah. Well, if Tim says so... I happen to completely disagree with him. BTW... Sadly, a lot of game programmers are not talented multi-threaded programmers:
http://groups.google.com/group/comp.programming.threads/msg/2c477272953fc5e0
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/6fe2dd51631afee5
http://groups.google.com/group/comp.arch/search?hl=en&group=comp.arch&q=games+multi-core&qt_g=1
That’s the sad truth. STM is not going to help them learn how to leverage lock-free programming against their very high-end game engines synchronization needs. I bet Tim doesn’t let his programmers use any Assembly Language either? I doubt it. I also doubt that he would choose a STM based engine, over a lock-free engine that outperforms it by an order of magnitude. He would have to be crazy to do something like that.
IMHO, game programming is all about squeezing all of the horsepower you can out of the architecture. STM prevents this because it will be flooding the cache coherency system with boat loads of unneeded traffic. Humm… Assembly language and game engines… Do they go together? Yes. Big time. Its going to be interesting to see how the gaming industry programs the Cell with STM… Wow. Its basically impossible. How to address the SPU with STM?
As for me not wanting to learn anything new… Well you wrong. I implemented STM. I know all of the issues. I have to make sure that I let everybody know what’s up, and let them get mislead by the marketing for STM.
"Again, if you validate x and y before computing the if statement in A, it can't go wrong. This is not an unavoidable problem."
Okay, well that makes STM doubly expensive. You have to see if you can commit, before you try to commit? What if your validation contains thousands of read transactions? How to address very large shared linked data-structures in STM without livelock and explicit contention management?
“As for me not wanting to learn anything new… Well you wrong. I implemented STM. I know all of the issues. I have to make sure that I let everybody know what’s up, and let them get mislead by the marketing for STM."
ACK...
I meant:
As for me not wanting to learn anything new… Well you wrong. I actually implemented several versions of STM; I know all of the issues. I have to make sure that I let everybody know what’s up, and NOT let them get mislead by the marketing for STM.
I almost forgot that the video mentions that they have to use Garbage Collector. I have STM that does not depend on a GC at all. Funny… I wonder why Simon and Tim depend on a GC? Its clearly not needed… A lightweight PDR scheme can implement STM…
Here they are the introductory papers I used for my vZOOM coolthreads project...
http://appcore.home.comcast.net/vzoom/
http://appcore.home.comcast.net/vzoom/round-1.pdf
http://appcore.home.comcast.net/vzoom/round-2.pdf
http://appcore.home.comcast.net/vzdoc/atomic/static-init/
This is a major alternative to STM. It outperforms it by orders of magnitude. The video should have mentioned other ways of doing things.
Any thoughts?
-- P.S. this is in regard to the SUN CoolThreads contest:
https://coolthreads.dev.java.net/
DAMN! This fourm software mess my post again!!!
ArGH! Let be try one more time!
Here they are the introductory papers I used for my vZOOM coolthreads project...
http://appcore.home.comcast.net/vzoom/
http://appcore.home.comcast.net/vzoom/round-1.pdf
http://appcore.home.comcast.net/vzoom/round-2.pdf
http://appcore.home.comcast.net/vzdoc/atomic/static-init/
This is a major alternative to STM. It outperforms it by orders of magnitude. The video should have mentioned other ways of doing things.
Any thoughts?
-- P.S. this is in regard to the SUN CoolThreads contest:
https://coolthreads.dev.java.net/
When STM detects a long transaction that conflict - it immediately switches to more pessimistic approach (which works a lot like reader/writer locks).
For such cases STM employ the use of a contention manager (CM) that detect conflicts and decides, according to a pre-defined policy, which read-mode technique to use. By default, when transaction starts, CM attempts to use the optimistic read technique, once it detects a conflict, it considers switching to a more pessimistic approach (using reader/writer locks) depending on the transaction size (defined by the reads count) and the re-execution count.
As you mentioned, a naive optimistic read technique with automatic execution schema can lead to starvation in cases where a large transaction conflict with smaller transactions that execute periodically.
From operating systems to multimedia, PC & mobile games to anti-virus, from drivers to registry cleaners and internet tools our website features all the latest soft wares for safe and free downloading enjoy.