Let’s face it: multi-threading has created some pains in the… well, as Forrest would say, butt-tox. You used to be able to write code assuming you had your own nice little sandbox, especially in a more protective language like Java, which allows you to be reasonably sure that some other schmo’s pointer won’t come weaseling its way onto your turf. But now you can have multiple threads that may want to get to the same pieces of data, so the threads have to learn how to play nice together, and that’s not always easy.
The issues all build on a few key concepts that at first blush seem so simple and straightforward. Then again, if they were, we wouldn’t be having this little chat, now, would we? So we’ll start at the start, or at least close. What’s the problem with multiple threads sharing data? Well, there’s the obvious first problem of only one entity having access to the data at a time, from a physical standpoint. An arbiter can generally handle this, granting memory access in turn to requesters. Yes, theoretically there’s some extremely tiny chance that two requests could happen at exactly the same attosecond causing some metastability-like confusion for an arbiter (even if you build in some kind of tie-breaking bias, you can always come up with some similar unlikely scenario), but – and this is key – that can generally be recovered from pretty easily. That’s not the big problem.
Race conditions and deadlocks
The big problem relates to series of operations we want to have handled “atomically.” An atomic operation is akin to what database folks would call a transaction: either the whole series of steps takes place, or none of them does. And it’s important that they take place uninterrupted. For example, if you’re transferring money from one of your bank accounts to another, there are two steps: remove money from account 1, and add money to account 2. If this process gets interrupted, and only the first step occurs, you’re not gonna be happy. Or let’s say you want to take your entire balance from one account and move it to another. This human-plus-computer sequence goes as follows: you check the balance to see what it is, you then subtract that amount of money from the first account and deposit that money into the second account. If, between checking the balance and removing the money, an outstanding check clears, the balance you checked is no longer valid, and you’re going to subtract more money than you have.
This is a situation referred to as a “data race” or a “race condition.” The outcome of a race condition is, by definition, unpredictable, and it can leave the system in an undefined state. Not a good idea. And unlike the simultaneous access problem above, this isn’t easy to recover from.
In order for your bank transaction to work reliably, you want all the steps to happen atomically: you get exclusive access to the resources – in this case, the bank accounts – and perform all the operations, and then let go so that someone else can get in there. There are two important concepts here: that of a series of steps that must be atomic, and that of protecting resources so that no one else gets them. These are independent (but related) items and both may have to be addressed.
The concept of an atomic series of steps gives rise to the so-called “critical section” of code. Only one thread is allowed to be executing a critical section of code at a time. How could more than one thread execute the same code? Consider a function that is called by two different threads. They could execute concurrently, messing up each others’ data, which is why they have to be restricted. Now, how could they really be executing concurrently? On a single-core processor, they wouldn’t be exactly concurrent, but the execution of one thread could be stopped to allow the other thread to execute, and that interruption could happen at any line of code. On a multicore processor, if the threads were assigned to different cores, then the execution could be concurrent.
A critical region, by design, executes atomically – if you ignore other unrelated accesses to the shared resources (more on that in a minute). But as a result, when one thread enters the critical section, then if another thread comes up to it, it has to wait until the first thread is done before it can enter, and so it stalls. As a result, you want your critical sections to be small and crisp to minimize the number of other threads hanging around doing nothing while they await their turn.
There still remains the problem that other unrelated code could access the shared data, and defining a critical region doesn’t help that. The general idea is that a piece of data can be locked, and that any code that wants to access the data has to acquire the lock first. If someone else has the lock, then the requesting code has to wait until the lock is released (and perhaps stand in line).
Even the process of getting a lock has to be done atomically. The steps are 1) checking to see if the lock is available, and 2) if so, grabbing the lock. But what happens if, after you see that the lock is available, your code is pre-empted by some other code that also sees that the lock is available and takes it? Now your code, which still thinks the lock is available, can’t grab the lock. This problem exists right down to the level of the machine instruction such that there are test-and-set instructions designed specifically to provide an atomic operation that can test a bit for 1, and if it’s not a 1, set it to a 1, without anyone being able to get in there and mess it up midstream.
Of course, you can get into trouble with locks. Let’s say you and I both want to gain access to data A and data B. For whatever reason, I take out a lock for data A as you’re taking out a lock for data B. I then try to get a lock for data B, but I can’t because you have it; you try to get a lock for data A, and you can’t because I have it. And we’re both going to sit here waiting for the other to finish with the lock we want, unaware that we’re waiting for each other. And we’ll wait a long long time. It’s the classic deadlock, and while pretty simple in this scenario, it can actually involve strings of entities waiting for each other in a way that’s less obvious.
One of the insidious problems with race conditions and deadlocks is that they are not guaranteed to be a problem; they are only a potential problem, and they may exist benignly for millions of hours before striking. Apparently, the huge blackout of 2003 has been traced to a race condition that lurked through three million hours of operation before darkening the lives of fifty million people.
The Java approach
In Java, a “synchronized” block or method can be used with an intrinsic or specific lock to set up critical regions and protect data. The semantics aren’t quite as transparent as the way things are described above, however. For instance, there is no required explicit association of locks and data or critical regions. There is a “guardedby” annotation that allows such an association to be stated, but it’s not required and is apparently not often used. So, not being a well-schooled Java programmer, this indirect approach to locking data and code can frankly leave my head spinning faster than a spinlock. Looking at various descriptions and solutions on the web, couched in more or less obtuse language, you end up feeling like this whole topic might be the realm of specialists. Which is a problem, since more and more programmers have to deal with thread-interaction issues.
And the evidence for that is the vast amount of work that has gone into static analysis programs designed to locate race conditions and deadlocks in code. Such programs can work in different ways, but all of them attempt to ensure that access to shared data occurs only through correct lock usage. Because these validation tools use static analysis, they comb through the code while the code is not executing. The accuracy of any such analysis for Java can’t be perfect, however, due to the fact that the data/lock associations aren’t explicit, so the tool has to guess to some degree. You could have the programmers provide that information as input, but that’s pretty impractical for huge programs, and even so, you would get only the ones the programmer remembered, and you wouldn’t test the ones he or she forgot – exactly the ones that are most likely to cause problems.
One of the companies addressing this space, Coverity, has just announced a dynamic analysis program to detect race conditions and deadlocks in Java programs. Unlike the static analysis tools, this checks for race conditions as the code is running, and it is apparently the first program to do so for Java. Dynamic testing is usually done by instrumenting source code (that is, temporarily adding extra code only for the purposes of analysis) to track and report on behavior as the program runs. A critical point that Coverity makes with respect to their coverage is that the analysis isn’t looking for race conditions when they actually occur, which might be never, but it looks for potential race conditions so that they can be identified before (or whether or not) they occur.
Combining static and dynamic
There are pros and cons to static and dynamic analysis when run by themselves. Dynamic analysis can demonstrate actual behavior as it occurs, but only in portions of the code that actually execute. Static analysis can completely cover all the code, but depending on what it’s testing, its ability to infer behavior may be more limited. Coverity actually makes a static analysis tool as well, and they recommend the use of both, allowing them to interact.
One way to do this is to start with a static analysis run, which will infer some shared data/lock associations. The ability to infer shared data is also the ability to infer non-shared data, so the static analysis tool will end up with data it is confident is shared, and test for race conditions on that, and it will end up with data it is confident is not shared, and it can report on which data that is. There is some gray region in between where the inference may not be strong, which is where a user may want to do some tuning on the sensitivity to avoid too many false positives – the bane of this kind of testing.
The report on non-shared data can be sent to the dynamic analysis tool, which can then skip doing any instrumentation for that data. This can speed up the execution, since instrumentation is bound to slow things down some (and minimizing that impact is usually a key design goal – Coverity claims a 2X runtime overhead, as compared to 10X and as high as 100X for other dynamic tools). The dynamic analysis tool gets to observe the actual acquiring of locks in association with data accesses, providing yet more confidence on data/lock associations. Once this is done, those associations can be fed back to the static analysis tool for one more run, which, now armed with better information on data/lock pairs, can potentially provide a more precise result than the first time, eliminating more of that gray area where inference was uncertain.
By including such tools in their validation suites, programmers can get better visibility into where shared data needs better protection. It may feel like yet more work, but the payoff is that you can identify extremely elusive flaws ahead of time, improving the quality and the robustness of the code before it ships. It’s far less effort than trying to debug race conditions as they happen.