Monday, May 23, 2011

On concurrency issues

Concurrency issues — race conditions and the like — are the worst category of bugs. These are the bugs that cannot well be proven absent by unit tests; these are the kind of bugs that hide away, biding their time until the most inopportune moment, then rearing their ugly, non-deterministic head on your production system when it is at its busiest. And even then, they can continue to exist unlocated for quite a while, despite many hours being put into tracking them down. Elusive, hardly reproducible, yet ultimately expensive; I've seen it happen more than once.

What follows is some thoughts on the basis of the typical issues.

Your basic multi-threading bug
As any course on multi-threaded programming will tell you, when multiple threads of execution are to run concurrently, care needs to be taken or there will be a risk of data corruption and/or unintended results.

More specifially, the threat (a "race condition" or "data race") is present when:
  1. one thread modifies the state of an object
  2. at the same time as
  3. another thread accesses the object.
That is: it takes a coincidence, a conjunction of three conditions.
Let's analyze it...:

Analysis

Another way of stating the above is obtained by reversing the statement:
We can avoid concurrency issues by always making sure that any object either
  1. is never modified; or
  2. is never accessed by two threads simultaneously (or stronger: never written to while accessed otherwise); or
  3. can only ever be accessed by one thread.
Such objects are known as, respectively,
  1. Immutable objects.
  2. Objects with state protected by a mutex (synchronization lock; monitor)
and the one used less often:
  1. Single-thread objects — or even stronger: objects of linear type.
The first two options are well-known; personally, I'm increasingly coming to consider the strong third option — objects with enforced linear lifecycle — to be rather overlooked, language design space-wise.

"Concurrency-ready" metric
One possible metric for how well a programming language is designed to express complex concurrent systems is, then, how easy it is to enforce that all objects fall into one of the three categories.

Languages and concurrency

Let's apply this view to a few programming languages. The two languages I've used most recently are Java and Erlang:
Erlang
In Erlang, there are the following kinds of objects: terms, message queues, process dictionaries and other process metadata, ETS tables, ports (files and drivers).
  • Terms are immutable values. They may contains handles of other kinds of objects, but the handles themselves are also immutable.
  • Certain objects — private ETS tables, and to some extent ports — are single-thread objects.
  • The rest — message queues, public ETS tables, process metadata etc. — are mutex-protected.
Single-thread objects are enforced not to be used by other threads than the owning one.
In Erlang, low-level data races only occur if there's a bug in the Erlang run-time system, or if you write your own driver and include one.

Java
In Java, you could argue both that there are fewer kinds of object — just one, really, the class-file defined kind — and that there is a much wider range of object kinds.There are certainly thousands of classes, defined by one well-defined scheme, which includes just a few mechanisms relevant to concurrency.
But the problem here is the number of classes — because it is at the class level that it is ensured that the corresponding objects will be thread-safe. A well-designed and -implemented class may be thread-safe, in that it is either immutable or uses appropriate inter-thread synchronization (and encapsulation) to ensure thread-safety. A less well-written class may rely implicitly on details of the context in which it is used, and really be single-thread-use only, or multi-thread usable only in certain unstated conditions.
Java is one of the few languages actually designed for portable multi-threaded programs; in particular, it has explicitly stated semantics wrt. multi-threaded execution. However, as the above comparison is one indication of, it has its shortcomings. For a language to claim good concurrency support, it should provide mechanisms for good, usable guarantees to be derived from local context (e.g. "we know that this-and-this property always holds, because these few lines here ensure that it does").
I've expounded earlier on the importance of supporting local reasoning. As it happens, Java did also then get some criticism (sorry, Java, but you're the modern main-stream language I happen to know the best...).

A Java exercise
Imagine that you have in front of you the source code of a Java class.
A quick inspection reveals that all methods are declared "synchronized".
What kinds of thread-safety issues might the class yet have? Try to list at least three ways in which the class may be non-thread-safe.
I'll present my list at the end of this article.

Does it matter?

But building security against these issues into a language is not exactly trivial, you might argue. Is stricter language rules, additional compiler analysis, and/or costly run-time support just for preventing concurrency issues not just overkill?
Sure, the trend is towards increasing parallellism and so on, but we have done quite fine without such extra measures so far, haven't we?
And bondage-and-discipline languages have been out of fashion for a while. Dynamically typed languages are as popular as ever!

The difference is this:

You can easily live and work with the relative uncertainties of dynamic typing — but then, you can unit test and get some confidence that the types match. If something is broken, or breaks later, then there's a good chance it'll be caught.
For many concurrency issues, unit tests are not likely to catch any errors. Furthermore, the necessary invariants aren't local — they are often widely dispersed in the code. To convince yourself that the code is correct, you more or less need to keep it all in your head at once. That, combined with missing language support for documenting and/or enforcing vital non-local invariants, means that they will perhaps not be communicated to whoever makes the next change in the code, who will therefore not have the full picture necessary to keep things correct.

Rather than being caught at the next full test suite run, or at least quite soon after the rubber hits the road, here's what happens to a race-condition bug:
  • The program may appear to work most of the time.
  • The issue will tend to manifest itself at the most inopportune moment: Not during development or testing (unless explicit and considerable effort is taken to stress-test against such issues), but in production, when your servers are at their busiest, or on the desktop of your busiest client.
  • Often, what clues you have amount to little besides "There almost certainly is an issue, it presumably is a software bug, the issue is probably in our code — and it occurs seldomly, so it's likely to be a concurrency issue of some kind."
  • Replicating the issue may be difficult; under the exact same circumstances, the program may run just fine.
    Indeed, if the problem does manifest itself during development, it'll appear to have gone at the next run.
    The issue may even be technically impossible to reproduce on some machine architectures, because it requires multiple physical processors of the right kinds to manifest itself (and your servers or other production environment is less likely than the development machines to be thus bug-resistant).
  • Eliminating a tricky concurrency bug can be a drawn-out experience in all phases — detecting it, reproducing it, tracking down its root cause, verifying that it has gone — all steps tend to be markedly more difficult than for deterministic bugs.
  • With any kind of bug, tracking it down once is one thing; there may be even be a feeling of gratification once you've done it. Tracking the same bug down twice is another matter — it is deeply frustrating to realize that there's a reason for your feeling of déja-vu. This is one of the reasons for regression testing: bug hunting may be stimulating in its own way, but it'd better lead to a different bug each time.
    This goes doubly (well, even more than that) for the elusive non-deterministic bugs.
    Sadly, because of their nature it is at best difficult, and often near impossible, to write regression tests for these kinds of issues.

Conclusion

Languages differ in concurrency support. Nothing new about that, of course, but I think it likely that many developers using one of the  mainstream languages which have a relatively good level of support in that area may not know that there are alternatives which are significantly more concurrency-ready.
In the beginning of this text, the prerequisites for a race condition were broken down and three kinds of conditions for avoiding race condition were derived; based on this I suggested a qualitative metric for a language's level of support for concurrent programming. I hope to have demonstrated that it may be a useful way of looking at both languages and concrete programs or program designs.

In the absence of a programming language providing strong, local guarantees with respect to thread-safety, as a developer you need to be alert whenever there's a chance that two threads may be executing your code concurrently. The best way of doing this is probably through discipline — for instance, by clearly constructing each class so that it falls into one of the above categories: Immutable, thread-safe through synchronization, or single-thread use only — and then using them strictly according to this. That is one way of trying to restore local reasoning.

Do you write programs involving multiple threads? If so, are you familiar with which consistency guarantees your platform actually provides (e.g, the Java Memory Model)?
Do you regularly stress-test your program on a system with multiple physical processors? I hope you do.
Dealing with concurrent programs in general requires good global, combinatorial-temporal reasoning abilities. Probably not your best-developed cognitive mode... :-)
The solution? Keep things as simple as possible. Find rules that work, and follow them. Encapsulate the issues, so that you can deal with just one question at a time. If possible, let the rules be checked mechanically.


Answer to Java exercise
Some ways in which a Java class may be unsafe even though all methods are declared "synchronized":
  1. A field is public, and either
    1. Non-final (and non-volatile) or
    2. Referring to a non-thread-safe object.
  2. The superclass is non-thread-safe, and the current class does not or can not override the causes.
  3. An instance method modifies the value of a a static variable without proper synchronization.
  4. An instance method reads the value of a a static variable which is also modified by a static method, without proper synchronization.
  5. A method exposes a reference to a non-thread-safe object referred to directly or indirectly by the current object.
    1. By returning such a reference.
    2. By passing such a reference to some method which stores the reference somewhere accessible to another thread.
    3. By starting a new thread and giving it such a reference.
    (I.e., a non-thread-safe object becomes shared.)
  6. A field (which may be private) refers to a non-thread-safe object which may be shared because
    1. It originated as (or was extracted from) a parameter to a constructor
    2. It originated as (or was extracted from) a parameter to a method
    3. It was returned from a call
    (I.e., a non-thread-safe object is already unexpectedly shared.)
  7. An inner class accesses an instance variable of the surrounding class, but fails to synchronize on the right this.
  8. An instance method locks on another object which may be of the same class (this may result in a deadlock)
    1. Implicitly, by calling a (synchronized) method on the other object
    2. Explicitly, using "synchronized"
    (A plausible example of this is a synchronized equals() method.)
  9. A constructor leaks this to some place where it can be accessed by another thread, and the object has at least one (final) variable which is accessed without synchronization.
    (This is because the special rule for final fields, which allows them to be accessed without synchronization, only applies after the constructor has completed.)
  10. A method exposes a reference to a thread-safe object referred to directly or indirectly by the current object, and that thread-safe object allows mutation (of itself or of a contained object) in a way that the current class is not prepared for.
This list is quite likely not complete; it is not the result of a systematic analysis of the matter.

2 comments:

  1. One important thread-safety problem with your fully synchronised structure may be non-atomic updates causing race conditions.

    For instance consider a Collections.synchronisedMap(m) where two client threads try to get something and then call put as the object isn't there on the read. Two value objects are then created and depending on the client code may now be in use. This is worse if the client code reads the value and then writes back a new value (say an Integer count) as the second put will over-write the first causing lost updates.

    ReplyDelete
  2. @jed: Indeed - This kind of problem is not covered by the basic data race dissection, but it is exactly what I tried to describe in answer #10. Just because all individual variables are protected in some way doesn't mean that the whole might not break...
    There may be invariants on all levels of a system, and that is difficult to come around.
    One partial solution -- where it is possible -- is not to expose mutable objects at all, where the state of these objects is part of the current object's invariants.

    Basically, there are no silver bullets for concurrent systems; I've experienced an application-level race condition in Erlang too. But it matters a great deal at which abstraction level your platform's guarantees stop, and where you start to have to be very careful yourself.

    ReplyDelete