April 26, 2012

Transactional Memory: a Possible Solution for Multicore Systems?

Community

As multicore systems become more prevalent, system architects are hard at work trying to provide functionality that allows applications to use those cores effectively. One proposal that is moving from the theoretical stage to the practical stage is known by the terminology "transactional memory" (TM), or "Hardware Transactional Memory (HTM).

Recently, Intel have announced support for transactional memory in their new "Haswell" line of chip architectures, and they've additionally provided some examples of how the new features work, as well as a complete and detailed specification (see chapter 8, in particular).

Intel describes the new functionality in the specification as follows:

Intel Transactional Synchronization Extensions (Intel TSX) allow the processor to determine dynamically whether threads need to serialize through lock-protected critical sections, and to perform serialization only when required. This lets the processor to expose and exploit concurrency hidden in an application due to dynamically unnecessary synchronization.

With Intel TSX, programmer-specified code regions (also referred to as transactional regions) are executed transactionally. If the transactional execution completes successfully, then all memory operations performed within the transactional region will appear to have occurred instantaneously when viewed from other logical processors. A processor makes architectural updates performed within the region visible to other logical processors only on a successful commit, a process referred to as an atomic commit.

As Greg Pfister notes in his blog article on TSX

The term "transaction" comes from contract law, was picked up by banking, and from there went into database systems. It refers to a collection of actions that all happen as a unit; they cannot be divided.

...

The notion of transactional memory is much the same: a collection of changes to memory is made all-or-nothing: Either all of them happen, as seen by every thread, process, processor, or whatever; or none of them happen. So, for example, when some program plays with the pointers in a linked list to insert or delete some list member, nobody can get in there when the update is partially done and follow some pointer to oblivion.

In his programming example, Coarse-grained locks and Transactional Synchronization explained, James Reinders of Intel describes how the transactional memory features enable improved concurrency:

The two operations that map to the same hash table entry will need to be staggered. This will occur even if we are unlucky enough to have them happen at the same time. In such a case, the Intel TSX will detect that the lock was indeed needed and some locking overhead will be incurred. What would actually happen in such a case, is that the colliding tasks will proceed into the protected code until the processor detects the conflict. As such a point, both updates will abort their protected code (also called the transaction). The most common solution then is to have each task proceed but actually enforce the lock on the second try. This means that one task will win, and delay the other, until the operation is complete. The precise decision on how to handle the collision is either up to the processor implementation with HLE, or the programmer with RTM. The processor implementation for HLE will also be fairly simple and conservative, in order to preserve the semantics of the original lock and hence compatibility with processors that lack Intel TSX.

Over on the Real World Technologies blog, David Kanter has dug into the TSX extensions in detail in his article: Analysis of Haswell's Transactional Memory. As Kanter notes, the study of transactional memory starts with the original academic work of Herlihy and Moss nearly two decades ago: Transactional memory: architectural support for lock-free data structures. Since that time, system builders have experimented with various implementations: Kanter describes the evolution through Transmeta's Crusoe processor, Sun Microsystems's Rock processor, AMD's Advanced Synchronization Facility, IBM's Blue Gene/Q processor, the Azul system, and more.

After describing the high-level overview of the Intel Haswell XBEGIN, XEND, XABORT, XACQUIRE, and XRELEASE instructions, Kanter indulges in some informed speculation about how the new features might work, observing that they're likely to build upon the memory management functionality of the system's multi-level memory caches:

Haswell's transactional memory is most likely a deferred update system using the per-core caches for transactional data and register checkpoints. Specifically, the tags for each cache line in the L1D and L2 probably contain bits that indicate whether the line belongs to the read-set (RS) or write-set (WS) for the two threads that can execute on a Haswell core. A store during a transaction will simply write to the cache line, and the shared L3 cache holds the original data (which may require an update to the L3 for the first write in a transaction to a Modified line). This is consistent with the shared and inclusive L3 cache that has been used since Nehalem. To avoid data corruption, any transactional data (i.e. lines in the RS or WS) must stay in the L1D or L2 and not be evicted to the L3 or memory. The architectural registers are checkpointed and stored on-chip, and the transaction can read and write from registers freely.

Although a lot of work has happened in the transactional memory field over the past two decades, it is still a rapidly developing field and there is much that is not yet known about how transactional memory works on real systems. In this area, it is interesting to look at the work of the MetaTM team led by Professor Emmett Witchel of the University of Texas.

The MetaTM project has been active for a number of years, and has explored a number of different areas:

  • Building a sophisticated hardware simulator capable of realistically evaluating different aspects of any particular transactional memory implementation
  • Modifying a large and realistic body of software (specifically, the Linux kernel) to use transactional memory in its implementation
  • Benchmarking different strategies and implementations to advance our knowledge of how transactional memory might work in practice
  • Evaluating the ease-of-use, complexity, and maintainability of transactional memory programming compared to other styles of programming.

You can find many interesting results from the MetaTM team linked from Christopher Rossbach's home page; Rossbach was one of the principal researchers on the MetaTM project. I was particularly fond of these papers:

  • Maximum Benefit from a Minimal HTM. In this paper, the authors conduct a detailed examination of the full Linux kernel running with transactional memory support. They find that:
    even a restricted version of HTM can address the synchronization problems of large, complex code bases by bringing the performance of fine-grained locking to coarse critical regions.

    A small runtime system can provide a simple, unbounded programming model for user-level transactions. Unbounded transactions proceed with the speed of HTM for most transactions and gracefully degrade in performance for those that overflow hardware resources.

  • Is Transactional Programming Actually Easier?. In this paper, which is more a "psychology of computer programming" approach, the authors investigate how programmers adapt to using transactional memory as a programming primitive. They observed that:
    students found transactions harder to use than coarse-grain locks, but slightly easier to use than fine-grained locks.

    Detailed examination of synchronization errors in the students' code tells a rather different story. Overwhelmingly, the number and types of programming errors the students made was much lower for transactions than for locks. On a similar programming problem, over 70% of students made errors with fine-grained locking, while less than 10% made errors with transactions.

When it comes to concurrency, ease of programming, and, more crucially, ease of correct programming is paramount, so it's great that the MetaTM team have been evaluating these impacts.

What sorts of programming projects are likely to arise from the increasing availability of transactional memory? Jeremy Manson, in an article on his blog, speculates that:

There is also likely to be a renaissance in lock-free data structures. Currently, lock-free data structures are pretty much all built on top of one non-blocking atomic instruction: the compare-and-swap (LOCK: CMPXCHG on x86, or AtomicReference.compareAndSet() to Java programmers). Trying to shoehorn everything you might want to do atomically into a single instruction that operates on one word has led to some very interesting data structures over the last 20 years or so, but is really the wrong way to solve the problem.

Meanwhile, we should start to see an increasing amount of hard evidence about how much benefit transactional memory can provide, and when. A few years ago, Cliff Click of Azul presented some of their early experiences. One of the Azul findings, discussed on Click's blog, was that implementations that depended on cache size were problematic, although most algorithms could be revised in small ways to significantly improve performance:

Users don't write "TM-friendly" code. Neither do library writers. For example the "modcount" field in Hashtable is bumped for every update in the Hashtable. For a large Hashtable we can expect most mods to be in unrelated portions of the Hashtable. i.e., without the modcount field update we would expect the HTM to allow parallel 'put' operations. With the modcount field update, 'put' operations always have a true data conflict

...

Many times a small rewrite to remove the conflict makes the HTM useful. But this blows the "dusty deck" code - people just want their old code to run faster. The hard part here is getting customers to accept that a code rewrite is needed. Once they are over that mental hump, once a code rewrite is "on the table" - then the customers go whole-hog. Why make the code XTN-friendly when they can make it lock-friendly as well, and have it run fine on all gear (and not just HTM-enabled gear)?

Multicore systems are here to stay, and the software field is racing furiously to understand how to use these systems effectively. The stakes are high: the systems that are best able to take advantage of modern multicore systems are the ones that will be used and will endure and succeed. It's not a certainty that transactional memory is the best approach, but it is a powerful and compelling technique, and as it starts to find widespread deployment, understanding what it is and how it works is quite interesting.