Reading notes for Java Concurrency in Practice

Perhaps surprisingly, concurrent programming isn’t so much about threads or locks, any more than civil engineering is about rivets and I-beams. Of course, building bridges that don’t fall down requires the correct use of a lot of rivets and I-beams, just as building concurrent programs require the correct use of threads and locks. But these are just mechanisms means to an end. Writing thread-safe code is, at its core, about managing access to state, and in particular to shared, mutable state.

Part I: Thread safety

A class is thread-safe if it behaves correctly when accessed from multiple threads, regardless of the scheduling or interleaving of the execution of those threads by the runtime environment, and with no additional synchronization or other coordination on the part of the calling code.

Every Java object can implicitly act as a lock for purposes of synchronization; these built-in locks are called intrinsic locks or monitor locks. The lock is automatically acquired by the executing thread before entering a synchronized block and automatically released when control exits the synchronized block. Intrinsic locks in Java act as mutexes.

A synchronized block has 2 parts: a reference to an object that will serve as the lock, and a block of code to be guarded by that lock. Static synchronized methods use the Class object for the lock.

Reentrancy: Intrinsic locks are reentrant. If a thread tries to acquire a lock that it already holds, the request succeeds. Reentrancy means that locks are acquired on a per-thread rather than per-invocation basis. This differs from the default locking behaviour for pthreads (POSIX threads) mutexes, which are granted on a per-invocation basis.

For each mutable state variable that may be accessed by more than one thread, all accesses to that variable must be performed with the same lock held. In this case we say that the variable is guarded by that lock.

Ways to share objects safetly:

Thread-confined
Shared read-only
Shared thread-safe
Guarded

Memory visibility:

Locking is not just about mutual exclusion; it is also about memory visibility. To ensure that all threads see the most up-to-date values of shared mutable variables, the reading and writing threads must synchronize on a common lock.

In general, there is no guarantee that the reading thread will see a value written by another thread on a timely basis, or even at all.

In the JVM, there is no guarantee that operations in one thread will be performed in the order given by the program, as long as the reordering is not detectable from within that thread - even if the reordering is apparent to other threads. This is meant to allow JVMs to take full advantage of the performance of modern multiprocessor hardware, e.g. in the absence of synchronization, the Java Memory Model permits the compiler to reorder operations and cache values in registers, and permits CPUs to reorder operations and cache values in processor-specific caches.

Out-of-thin-air safety: The guarantee that when a thread reads a variable without synchronization, it may see a stale value, but at least it sees a value that was actually placed there by some thread rather than some random value. This applies to all variables except 64-bit numeric variables (double and long). Out-of-thin-air safety does not apply to 64-bit numeric variables that are not declared volatile because the JVM is permitted to treat a 64-bit read or write as two separate 32-bit operations (i.e. fetch and store operations are not atomic). So if the reads and writes occur in different threads, it is possible to read a nonvolatile long and get back the high 32 bits of one value and the low 32 bits of another.

Intrinsic locking can be used to guarantee that one thread sees the effects of another in a predictable manner. Everything A did in or prior to a synchronized block is visible to B when it executes a synchronized block guarded by the same lock. Without synchronization, there is no such guarantee.

Volatile:

Volatile variables is a weaker form of synchronization than synchronized. When a field is declared volatile, the compiler and runtime are put on notice that this variable is shared and that operations on it should not be reordered with other memory operations. Volatile variables are not cached in registers or in caches where they are hidden from other processors, so a read of a volatile variable always returns the most recent write by any thread.

Accessing a volatile variable performs no locking and cannot cause the executing thread to block, making it a lighter-weight synchronization mechanism than synchronized.

The most common use for volatile variables is as a completion, interruption, or status flag. Use volatile variables only when all the following criteria are met:

writes to the variable do not depend on its current value, or you can ensure that only a single thread ever updates the value
the variable does not participate in invariants with other state variables
locking is not required for any other reason while the variable is being accessed

Locking can guarantee both visibility and atomicity; volatile variables can only guarantee visibility.

Publication and escape:

It means making it available to code outside of its current scope. Do not let the this reference escape during construction. When an object creates a thread from its constructor, it almost always shares its this reference with the new thread. The new thread might be able to see the owning object before it is fully constructed. You can create a thread in a constructor, but best not to start it. If you want to start it in the constructor, you can avoid improper construction by using a private constructor and a public factory method (JCIP P42).

Thread confinement:

Confine objects to a thread, so that we don’t share mutable state.

Ad-hoc thread confinement: describes when the responsibility for maintaining thread confinement falls entirely on the implementation.
Stack confinement: An object can only be reached through local variables, which are intrinsically confined to the executing thread; they exist on the executing thread’s stack, which is not accessible to other threads.
- For primitively typed local variables, you cannot violate stack confinement even if you tried. There is no way to obtain a reference to a primitive variable, so Java ensures they are always stack confined.
ThreadLocal: Allows you to associate a per-thread value with a value-holding object, i.e. this value is local to the thread.

Immutable objects are always thread-safe. Immutability is not equivalent to simply declaring all fields of a object final, since final fields can hold references to mutable objects. An object is immutable if:

its state cannot be modified after construction
all its fields are final
it is properly constructed (the this reference does not escape during construction)

Marking a field final means it cannot be modified (although the objects they refer to can be modified if they are mutable). It is good practice to make all fields final unless they need to be mutable. There is a difference between an object being immutable and the reference to it being immutable. Program state stored in immutable objects can still be updated by “replacing” immutable objects with a new instance holding new state. Do not need to fear that this approach will create performance problems - allocation is cheaper than you might think, and immutable objects offer additional performance advantages such as reduced need for locking or defensive copies and reduced impact on generational garbage collection.

When a group of related data items must be acted on atomically, consider creating an immutable holder class for them (JCIP P48-49).

Safe publication:

This is unsafe:

// Unsafe publication
public Holder holder;

public void initialize() {
    holder = new Holder(42);
}

Improper publication can allow another thread to observe a partially constructed object. (didn’t understand how?)

Immutable objects can be used safely by any thread without additional synchronization, even when synchronization is not used to publish them.

To publish an object safely, both the reference to the object and the object’s state must be made visible to other threads at the same time. A properly constructed object can be safely published by:

initializing an object reference from a static initializer
storing a reference to it into a volatile field or AtomicReference
storing a reference to it into a final field of a properly constructed object
storing a reference to it into a field that is properly guarded by a lock

e.g. we can place an object in a thread-safe collection, such as Vector or synchronizedList. or e.g. use a static initializer (static initializers are executed by the JVM at class initialization time, because of internal synchronization in the JVM, this mechanism is guaranteed to safely publish any objects initialized in this way):

public static Holder holder = new Holder(42)

Composing Objects

Encapsulating data within an object confines access to the data to the object’s methods, making it easier to ensure that the data is always accessed with the appropriate lock held (instance confinement, often just called confinement). Confinement makes it easier to build thread-safe classes because a class that confines its state can be analyzed for thread safety without having to examine the whole program.

class X {

    private int a;
    private int b;

    public synchronized void addA(){
        a++;
    }

    public synchronized void addB(){
        b++;
    }

}

Synchronizing on the method is functionally equivalent to having a synchronized(this) around the body of the method. The “this” object doesn’t get locked, rather the object “this” is used as the mutex and the body is prevented from executing concurrently with other code sections also synchronized on “this”, i.e. you cannot run addA and addB at the same time. It has no effect on other fields/methods of “this” that aren’t synchronized.

Java monitor pattern: An object following the Java monitor pattern encapsulates all its mutable state and guards it with the object’s own intrinsic lock.

If a class is composed of multiple independent thread-safe state variables and has no operations that have any invalid state transitions, then it can delegate thread safety to the underlying state variables.

To add functionality to existing thread-safe classes, you can extend the class, do client-side locking, or do composition.

Client-side locking means to extend the functionality of the class without extending the class itself by placing the extension code in a “helper” class. When doing this, remember to use the correct lock (don’t lock on the helper class instance itself). Lock on the original threadsafe class.

@ThreadSafe
public class ListHelper<E> {
    public List<E> list =
        Collections.synchronizedList(new ArrayList<E>());
    ...
    public boolean putIfAbsent(E x) {
        synchronized (list) {
            boolean absent = !list.contains(x);
                if (absent)
                    list.add(x);
                return absent;
        }
    }
}

In composition, we add an additional layer of locking using its own intrinsic lock, e.g.

@ThreadSafe
public class ImprovedList<T> implements List<T> {
    private final List<T> list;
    public ImprovedList(List<T> list) { this.list = list; }

    public synchronized boolean putIfAbsent(T x) {
        boolean contains = list.contains(x);
        if (contains)
            list.add(x);
        return !contains;
    }
    public synchronized void clear() { list.clear(); }
    // ... similarly delegate other List methods
}

Composition is less fragile because it provides thread safety even if the List is not thread-safe or changes its locking implementation (in contrast to extending the class and doing client-side locking).

Building Blocks

Blocking queues lend themselves to the producer-consumer pattern.

Deques (pronounced “deck”) lend themselves to work stealing pattern: every consumer has its own deque. If a consumer exhausts the work in its own deque, it can steal work from the tail of someone else’s deque. Work stealing can be more scalable than a traditional producer-consumer design because workers don’t contend for a shared work queue; most of the time they access only their own deque, reducing contention. When a worker has to access another’s queue, it does so from the tail rather than the head, further reducing contention. Work stealing is suited to problems in which consumers are also producers, for example, processing a page in a web crawler usually results in the identification of new pages to be crawled. When a new worker identifies a new unit of work, it places it at the end of its own deque (or alternatively, in a work sharing design, on that of another worker).

A synchronizer is any object that coordinates the control flow of threads based on its state. Blocking queues can act as synchronizers; other types of synchronizers include semaphores, barriers, and latches.

A latch is a synchronizer that can delay the progress of threads until it reaches its terminal state.

A barrier blocks a group of threads until some event has occurred. The difference between a latch is that latches are for waiting for events; barriers are for waiting for other threads. Barriers are often used in simulations, where the work to calculate one step can be done in parallel but all the work associated with a given step must complete before advancing to the next step. e.g. in n-body particle simulations, each step calculates an update to the position of each particle based on the locations and other attributes of the other particles. Waiting on a barrier between each update ensures that all updates for step k have completed before moving on to step k+1.

Another form of barrier is Exchanger, a two-party barrier in which the party exchange data at the barrier point.

Part II: Task Execution

Disadvantages of unbounded thread creation:

Thread lifecycle overhead: thread creation and teardown are not free.
Resource consumption: active threads consume system resources, especially memory. Idle threads can tie up a lot of memory
Stability: there is a limit on how many threads can be created.

Cancellation and Shutdown

Threads are divided into two types: normal threads and daemon threads. When the JVM starts up, all the threads it creates (such as garbage collector and other housekeeping threads) are daemon threads, except the main thread. When a new thread is created, it inherits the daemon status of the thread that created it, so by default any threads created by the main thread are also normal threads.

Normal threads and daemon threads differ only in what happens when they exit. When there are no more normal threads (even though there may be daemon threads running), the JVM initiates an orderly shutdown. When the JVM halts, any remaining daemon threads are abandoned - finally blocks are not executed, stacks are not unwound - the JVM just exits. Daemon threads should be used sparingly since most threads need some kind of cleanup, esp tasks that perform I/O. Daemon threads are best saved for “housekeeping” tasks, such as a background thread that periodically removes expired entries from an in-memory cache.

Avoid finalizers (what are finalizers?? JCIP P165).

Applying Thread Pools

Executor framework decouples task submission from task execution. However, some kinds of tasks require special execution policies.

Dependent tasks: tasks that depend on the timing, results, or side effects of other tasks. This creates constraints on the execution policy that must be carefully managed to avoid liveness problems.
Tasks that exploit thread confinement: tasks that are not thread-safe, so they require their executor to be single-threaded, e.g. newSingleThreadExecutor. If in this case you use a thread pool, then thread-safety could be lost.
Response-time-sensitive tasks: e.g. GUI applications. Submitting a long-running task to a single-threaded executor, or submitting several long-running tasks to a thread pool with a small number of threads, may impair the responsiveness of the service managed by that Executor.
Tasks that use ThreadLocal: ThreadLocal allows each thread to have its own private “version” of a variable. However, executors are free to reuse threads as they see fit. ThreadLocal makes sense to use in pool threads only if the thread-local value has a lifetime that is bounded by that of a task; ThreadLocal should not be used in pool threads to communicate values between tasks.

Thread pools work best when tasks are homogeneous and independent. Mixing long-running and short-running tasks risks “clogging” the pool unless it is very large; submitting tasks that depend on other tasks risks deadlock unless the pool is unbounded.

Thread starvation deadlock can occur whenever a pool task initiates an unbounded blocking wait for some resource or condition that can succeed only through the action of another pool task, such as waiting for the return value or side effect of another task, unless you can guarantee that the pool is large enough. e.g. in a single-threaded executor, a task that submits another task to the same executor and waits for its result will always deadlock.

Sizing thread pools: For compute-intensive tasks, an N\_cpu system usually achieves optimum utilization with a thread pool of N\_cpu + 1 threads.

Optimal pool size is N\_threads = N\_cpu U\_cpu (1 + W/C) where N\_cpu is number of CPU, U\_cpu is target CPU utilization, between 0 and 1, and W/C is the ratio of wait time to compute time.

Loop parallelization can also be applied to some recursive designs; there are often sequential loops within the recursive algorithm that can be parallelized, e.g. the below example does a depth-first traversal of a tree, performing a calculation on each node and placing the result in a collection.

public<T> void sequentialRecursive(List<Node<T>> nodes,
                                Collection<T> results) {
    for (Node<T> n : nodes) {
        results.add(n.compute());
        sequentialRecursive(n.getChildren(), results);
    }
}

public<T> void parallelRecursive(final Executor exec,
                                List<Node<T>> nodes,
                                final Collection<T> results) {
    for (final Node<T> n : nodes) {
        exec.execute(new Runnable() {
            public void run() {
                results.add(n.compute());
            }
        });
        parallelRecursive(exec, n.getChildren(), results);
    }
}

When parallelRecursive returns, each node in the tree has been visited (the traversal is still sequential: only the calls to compute are executed in parallel) and the computation for each node has been queued to the Executor.

GUI Applications

Nearly all GUI toolkits are single-threaded. Multithreaded GUI frameworks tend to be particularly susceptible to deadlock, partially because of the interaction between OS and the object-oriented modeling of GUI components. Actions initiated by the user tend to “bubble up” from the OS to the application–a mouse click is detected by the OS, is turned into a “mouse click” event by the toolkit, and is eventually delivered to an application listener as a higher level event such as a “button pressed” event. On the other hand, application-initiated actions “bubble down” from the application to the OS–changing the background color of a component originates in the application to the OS–changing the background color of a component originates in the application and is dispatched to a specific component class and eventually into the OS for rendering. Combining this tendency for activities to access the same GUI objects in the opposite order with the requirement of making each object thread-safe yields a recipe for inconsistent lock ordering, which leads to deadlock.

Part III: Avoiding Liveness Hazards

A program will be free of lock-ordering deadlocks if all threads acquire the locks they need in a fixed global order.

Invoking an alient method with a lock held is asking for liveness trouble. The alien method might acquire other locks (risking deadlock) or block for an unexpectedly long time, stalling other threads that need the lock you hold. Calling a method with no locks held is called an open call, and classes that rely on open calls are more well-behaved and composable than classes that make calls with locks held.

Restructuring a synchronized block to allow open calls can sometimes have undesirable consequences, since it takes an operation that was atomic and makes it not atomic. If the loss of atomicity is a problem, you have to use another technique to achieve atomicity.

One such technique is to structure a concurrent object so that only one thread can execute the code path following the open call. For example, when shutting down a service, you may want to wait for in-progress operations to complete and then release resources used by the service. Holding the service lock while waiting for operations to complete is inherently deadlock-prone, but releasing the service lock before the service is shut down may let other threads start new operations. The solution is to hold the lock long enough to update the service state to “shutting down” so that other threads wanting to start new operations–including shutting down the service–see that the service is unavailable, and do not try. You can then wait for shutdown to complete, knowing that only the shutdown thread has access to the service state after the open call completes. Thus, rather than using locking to keep the other threads out of a critical section of code, this technique relies on constructing protocols so that other threads don’t try to get in.

Liveliness hazards: deadlock, livelock, starvation, missed signals

Performance and Scalability

Amdahl’s law: it describes how much a program can theoretically be sped up by additional computing resources, based on the proportion of parallelizable and serial components. Amdahl’s law says that we can achieve a speedup of at most:

Speedup <= 1/(F + (1-F)/N)

where N is the number of processors on the machine, and F is the fraction of the calculation that must be executed serially.

e.g. a program in which half of the processing must be executed serially can be sped up only by a factor of two regardless of the number of processors. A program in which 0.1 must be executed serially can be sped up by at most a factor of 10.

Testing Concurrent Programs

Measure throughput
Sometimes we also want to measure the variance of service time

Nonfair semaphores provide much better throughput and fair semaphores provide (typo in JCIP lol) lower variance, unless threads are continually blocking anyway because of tight synchronization requirements (e.g. a bounded buffer with buffer size of one). Because results are dramatically different, Semaphore forces its clients to decide which of the two factors to optimize for.

Java is a dynamically compiled language, compared to statically compiled like C or C++. The HotSpot JVM (and other modern JVMs) uses a combination of bytecode interpretation and dynamic compilation. When a class is first loaded, the JVM executes it by interpreting the bytecode. At some point, if a method is run often enough, the dynamic compiler kicks in and converts it to machine code; when the compilation completes, it switches from interpretation to direct execution.

(from https://www.ibm.com/developerworks/java/library/j-jtp12214/) JIT compiler: Just-in-time compiler. Converts all bytecodes into machine code before execution, but does so in a lazy fashion: The JIT only compiles a code path when it knows that code path is about to be executed. This approach allows the program to start up more quickly, as a lengthy compilation phase is not needed before any execution can begin. While technically, a JIT-based virtual machine compiles each bytecode before it is executed, the term JIT is often used to refer to any dynamic compilation of bytecode into machine code–even those that can also interpret bytecodes.

HotSpot dynamic compilation: Combines interpretation, profiling, and dynamic compilation. Rather than convert all bytecodes into machine code before they are executed, HotSpot first runs as an interpreter and only compiles the “hot” code–the code executed most frequently. As it executes, it gathers profiling data, used to decide which code sections are being executed frequently enough to merit compilation. Only compiling code that is executed frequently has several performance advantages: No time is wasted compiling code that will execute infrequently, and the compiler can, therefore, spend more time on optimization of hot code paths because it knows that the time will be well spent. Furthermore, by deferring compilation, the compiler has access to profiling data, which can be used to improve optimization decisions, such as whether to inline a particular method call.

One way to prevent compilation from biasing your results is to run your program for a long time (at least several minutes) so that compilation and interpreted execution represent a small fraction of the total run time.

One of the challenges of writing good benchmarks is that the optimizing compilers are adept at spotting and eliminating dead code–code that has no effect on the outcome. Since benchmarks often don’t compute anything, they are an easy target for the optimizer.

Explicit locks

Building Custom Synchronizers

Atomic Variables and Nonblocking Synchronization

Nonblocking algorithms use low-level atomic machine instructions such as compare-and-swap instead of locks to ensure data integrity under concurrent access. Nonblocking algorithms are used extensively in operating systems and JVMs for thread and process scheduling, garbage collection, and to implement locks and other concurrent data structures.

They don’t block when multiple threads contend for the same data. They are immune to deadlock and other liveness problems. In lock-based algorithms, other threads cannot make progress is a thread goes to sleep or spins while holding a lock, whereas nonblocking algorithms are impervious to individual thread failures.

Suspending and resuming a thread has a lot of overhead and generally entails lengthy interruption. e.g. when a thread is resumed, it may have to wait for other threads to finish their scheduling quanta before it is actually scheduled.

Exclusive locking is a pessimistic technique. For fine-grained operations, there is an alternate optimistic approach. It uses collision detection to determine if there has been interference from other parties during the update, in which case the operation fails and can be retried (or not).

Compare-and-swap (CAS) has 3 operands:

a memory location V
the expected old value A
the new value B

It means to update V to new value B, only if the value in V matches the expected old value A. Typical pattern for using compare-and-swap (CAS) is to read value, derive new value, then use CAS to atomically change V from A to B as long as no other thread has changed V to another value in the meantime.

The atomic variable classes e.g. AtomicInteger provides get and set methods with the same memory semantics as reads and writes to a volatile int.

The atomic array classes are arrays whose elements can be updated atomically. The atomic array classes provide volatile access semantics to the elements of the array, but a volatile array has volatile semantics only for the array reference, not for its elements.

The key to creating nonblocking algorithms is figuring out how to limit the scope of atomic changes to a single variable while maintaining data consistency.

The Java Memory Model

Suppose one thread assigns a value to aVariable:

aVariable = 3;

A memory model addresses the question “Uner what conditions does a thread that reads aVariable see the value 3?” In the absence of synchronization, there may be a few reasons why a thread might not immediately or ever see the results of an operation in another thread:

compilers may generate instructions in a different order than the “obvious” one suggested by the source code
compilers may store variables in registers instead of in memory
processors may execute instructions in parallel or out of order
caches may vary the order in which writes to variables are committed to main memory
values stored in processor-local caches may not be visible to other processors

In a shared-memory multiprocessor architecture, each processor has its own cache that is periodically reconciled with main memory. Processor architectures provide varying degrees of cache coherence; some provide minimal guarantees that allow different processors to see different values for the same memory location at virtually any time. An architecture’s memory model tells programs what guarantees they can expect from the memory system, and specifies the special instructions required (called memory barriers or fences) to get the additional memory coordination guarantees required when sharing data. Java provides its own memory model to shield the Java developer from the differences between memory models across architectures.

This is not safe:

@NotThreadSafe
public class UnsafeLazyInitialization {
    private static Resource resource;
    public static Resource getInstance() {
        if (resource == null)
            resource = new Resource(); // unsafe publication
        return resource;
    }
}

Not only because it can see a stale value of resource, but also there is no happens-before ordering between the writing of resource in one thread A and reading of resource in another thread B. If thread A instantiates a new resource, B could see A’s actions in a different order than A performed them, i.e. B could see the write to resource as occurring before the writes to the fields of the Resource, i.e. B could see a partially constructed Resource that may be in an invalid state and may unexpectedly change later.

Static initializers are fine, because the JVM acquires a lock during initialization, and this lock is acquired by each thread at least once to ensure that the class has been loaded, so memory writes made during static initialization are automatically visible to all threads, i.e. this is ok:

@ThreadSafe
public class EagerInitialization {
    private static Resource resource = new Resource();
    public static Resource getResource() { return resource; }
}

You can also use the lazy initialization holder class idiom (the first call to getResource by any thread causes ResourceHolder to be loaded and initialized by the static initializer):

@ThreadSafe
public class ResourceFactory {
    private static class ResourceHolder {
        public static Resource resource = new Resource();
    }
    public static Resource getResource() {
        return ResourceHolder.resource;
    }
}

Initializing with final fields is also ok because all writes to final fields made by the constructor, as well as as any variables reachable through those fields, become “frozen” when the constructor completes, and any thread that obtains a reference to that object is guaranteed to see a value that is at least as up to date as the frozen value. i.e. this is ok because states is final:

@ThreadSafe
public class SafeStates {
    private final Map<String, String> states;

    public SafeStates() {
        states = new HashMap<String, String>();
        states.put("alaska", "AK");
        states.put("alabama", "AL");
        ...
        states.put("wyoming", "WY");
    }
    public String getAbbreviation(String s) {
        return states.get(s);
    }
}