SDL: Adding Atomic Operations to SDL 1.3, part #3 Redesign
In part #2 I looked at the built in atomic operations in QNX, Windows, Max OS X, and Linux. I found that they are all different. Some, especially QNX, are more different than others. Part #2 did a much more complete comparison of the different sets of atomic operations than I did when I designed the atomic ops for SDL 1.3. Sad, but true, I do a better job when I am writing for others than when I know no one will question my thought process. That is why public review is so important.
What have I learned? Well, first off, there is no point in supporting 8 and 16 bit atomic ops. By and large they are not available on Windows, Mac OS X, or QNX. So, why bother. OK, I just ripped them out of the code.
What next?
Let's identify some goals for the redesign. The original design assumed that it was reasonable to have operations that work on specific sizes of operations. The trouble is that if you write code to manipulate pointers you need to know the size of the pointer. Pointers can be several different sizes depending on the architecture of the computer. And, history has taught us that pointers need not be something that can be cast to any type of int. So, one goal is to evaluate the need for atomic operations on pointers.
I also want to reduce the amount of emulation as much as possible. That means that if I can't figure out how to implement an operation on the four operating systems I have looked at, I'm not going to include it at all.
Most importantly I want to look at how the operations might be used and make sure we have a set of operations that supports those uses. Yeah, I'm going to guess how people might use these things... a very dangerous task. Since I can't know what other people will want to do I am going to take a look at what I want to do with atomic operations. So what are the tasks I envision for atomic ops?
-
Reference counts: I am very fond of using reference counts for storage management. In a multithreaded environment reference counts must be incremented and decremented atomically.
-
Spin locks: Locks that depend on TestThenSet and Clear to implement simple mutexes.
-
Queues: Probably the oldest interprocess communication system ever devised, queues need to be implemented efficiently. I want to look at queues with N readers and N writers as well as the case where there is a single reader and a single writer.
-
Read and Write locks: Atomic ops will be used to solve the multiple reader, multiple writer problem on structures such as binary space partitioning (BSP) trees and octrees. I expect that atomic ops will be used to share complex data structures with locks at the lowest level of the structures similar to record level locks in databases.
Let's take a look at how each of these examples are implemented using atomic operations.
Reference Counts
Reference counts are, literally, a count of the number of pointers that point to a location in memory. When the count drops to zero, the memory can be freed. I like reference counts because they automatically clean up dynamically allocated memory while avoiding the long breaks in execution that can result from poorly implemented garbage collectors. They do cost a bit in execution time, but they result in the consistent execution times needed by interactive software.
In a multiprocessor system reference counting code must be atomic if you want the counts to be correct. Think about what happens if a processor is incrementing a count while another processor is decrementing it. Each processor loads the value into a register, then one processor increments the value and one decrements it. Then both processors store the value back into memory. After one increment and one decrement the count should be the same as before either operation occurred. But, in a multiprocessor environment the thread that stores last determines the final value. That value will be either one too high, or one too low. We must make sure that no processor can fetch a reference count until after any other processor has changed and stored it value.
The result of incorrect reference counts are memory leaks when the count is erroneously set too high and references to unallocated memory when the count is set too low.
Let's look at how reference counting works by looking at a bit a pseudo C++ code:
Thing * a = new(Thing);
Thing is a reference counted type. We have declared a of type pointer to Thing and assigned it a pointer to a new(Thing). At This point the new Thing has a reference count of 1. Notice that reference counts are associated with the Thing and not the pointer.
Thing * b = new(Thing);
B is also a pointer to Thing and its current value is the new Thing. The new Thing has a reference count of 1.
b = a;
The contents of a are copied into b. *A's reference count is incremented and *b's reference count is decremented. The order the operations is very important. *A's reference count must be incremented before *b's reference count is decremented. If you do them in the other order, and they both point to the same Thing, then that Thing's reference count will drop to zero and it will be deleted just before it's reference count is incremented. That is a nasty bug.
After the assignment is finished *a's Thing will have a reference count of 2 and b's original Thing will have a reference count of 0. The Thing that b used to point to will have been deleted.
a = 0;
A now points to nothing and the reference count of the remaining Thing gets decremented to 1.
b = 0;
B now points to nothing and the reference count of the remaining Thing gets decremented to 0. That Thing is deleted.
From looking at this example we know that we need an atomic increment and decrement. We also know that we must preserve the order of the atomic operations. The increment must complete before the decrement begins. We also have to test the value of the reference count after it is decremented. We must test because when the count is zero the memory is deleted.
If the decrement is atomic, and the fetch is not, we can get a nasty situation. The decrements are both atomic so they must complete without interference from another processor. But, if both processors are allowed to fetch the value at the same time, we can wind up with both processors deleting the memory. That is usually a crash bug.
-
Time
Processor #1
Processor #2
Count Value
0
2
1
AtomicDecrement
1
2
AtomicDecrement
0
3
Fetch
Fetch
0
4
Delete if count == 0
Delete if count == 0
0
It seems clear that we must have a DecrementThenFetch operation that combines decrementing the value and then fetching the value as a single atomic operation. But, it turns out that we can make due with a FetchThenDecrement operation and test for a count of 1 instead of 0. That works because we know that 1 – 1 = 0.
It should be noted that for addition and subtraction pre-fetch operation can be implement in terms of post-fetch operations and vice versa by doing a little extra arithmetic. And, of course, increment and decrement can be implemented in terms of addition and subtraction.
To do reference counting I need an atomic add or increment operation and a decrement or subtraction operation that combines a fetch with the decrement operation. It appears that all of the required operations are available on all four operating systems.
Spin Locks
The spin lock is an old old way of implementing a mutex. It is called a spin lock because if a thread comes along and finds the mutex is already locked, it sits in a tight loop (spinning) waiting for the mutex to be unlocked. Spin locks are simple to implement if you have the right atomic operations. On the one hand, processors can waste a lot of time waiting if the lock is kept locked for long periods of time. OTOH, spin locks never send a waiting thread off to the operating system to be rescheduled.
A spin lock value is just a word in memory that can be set to 0 or 1, or more formally to Locked, and Unlocked.
To lock the mutex we need an atomic operation that first gets the current value of the lock and then sets it to a new value. The new value is always set to Locked so that after the operation has completed the mutex is locked. The old value may be Locked, if the mutex is locked. Or, it may be Unlocked, if the mutex was unlocked. If the mutex was already locked then setting its value to Locked does not change its value and if it was not locked, then setting its value to Locked locks the mutex.
There are many ways to implement the locking operations. Some machines combine fetching, testing, and setting the mutex value into a single atomic instruction called TestAndSet. (All through these articles I have called test-and-set by the name TestThenSet. I do that because the operation tests and then sets a value. It doesn't do them at the same time.) Other machines provide an atomic exchange operation. Even an atomic FetchThenOr(1) can be used to implement a spin lock.
To unlock the mutex all we have to do is set its value to Unlocked. The clear operation can be a special atomic Clear, an atomic And(0), or an atomic exchange operation. All it has to do is set the mutex to Unlocked. No testing is needed.
Here is some pseudo C++ code that implements a spin lock using an atomic exchange operation:
SpinLock lock = Unlocked; // create and initialize a spin lock
void Lock(SpinLock &m)
{
SpinLock old = Locked;
AtomicExchange(old, m);
while(old == Locked)
{
AtomicExchange(old, m);
}
}
void Unlock(SpinLock * m)
{
SpinLock new = Unlocked;
AtomicExchange(new, m);
}Here is pseudo code for a spin lock implemented using a TestThenSet operation that returns true when it successfully locks the mutex and false otherwise and a Clear that unlocks the mutex:
SpinLock lock = Unlocked; // create and initialize a spin lock
void Lock(SpinLock &m)
{
while(!TestThenSet(m))
{
}
}
void Unlock(SpinLock * m)
{
Clear(m);
}All the operating systems provide the tools needed to implement spin locks. They do not all provide an atomic exchange. The question is do we provided TestThenSet and Clear operations, do we provide spin locks, or do we do both?
One thing that bothers me about spin locks is that the wait loop doesn't end until some other thread does something. That wait loop ties up a hardware processor for an indefinite time. What if the thread we are waiting on has lost its processor? If that happens we could be waiting for a long time doing absolutely nothing useful. It would be nice to make that processor available to do useful work instead of just sitting there. So, how do we gracefully give up the processor?
Once you start thinking like that it is easy to wind up coding a cooperative threading package on top of the existing thread package :-) Which is fun to try and even more fun to do, but... you might just consider putting a call to SDL_Delay(0) inside the wait loop. Not nice, in fact I've called doing that a “sin” in the past. But, it just might be the right thing to do in this case.
Queues
I've only seen two ways to implement queues. You can implement them using a linked list, or using a fixed length vector. My examples are going to use fixed length queues because the pointer manipulation needed to implement linked list queues require very specialized atomic operations or they require the use of a lock.
Queues have one rule, the first thing in is the first thing out. This property is also known as the FIFO rule. This is important because believing in the FIFO rule can get you in trouble. People tend to believe that the first thing out will be the first thing processed. That isn't necessarily true when dealing with multiple processors. If the first item pulled off the queue takes longer to process than the second item and they are being processed by two different threads then they will appear to have been processed out of order because they finished out of order. This problem can be a real brain twister when you are trying to figure out what your code really does. I lost of a lot of sleep the first time I ran into this problem.
In multiprocessor programming queues are used in four different ways:
-
Single Writer
Single Reader
Single Writer
Multiple Readers
Multiple Writers
Single Reader
Multiple Writers
Multiple Readers
There is a well known, so called “lockless” algorithm for creating fixed length queues with a single reader and a single writer. I say “so called” because it is not truly lockless. The algorithm depends on the fact that reads and writes of certain properly aligned chunks of memory are actually atomic operations. On x86 machines properly aligned reads and writes of length 1, 2, 4, 8, and 16 bytes are guaranteed to be atomic. The algorithm is also sensitive to operation order and there are reports of it failing as the result of optimization.
The queue is implemented as a fixed length buffer with a head index and tail index. The algorithm works because only the writer thread is allowed to change the head index and only the reader thread is allowed to change the tail index.
There are some rules you need to know to understand how this queue works:
-
If head == tail then the queue is empty.
-
If ((head + 1) % size) == tail then the queue is full.
-
Otherwise, the queue is neither empty nor full.
-
Head always points to an empty slot in the queue. That means that there is always an empty slot in the buffer.
Here is C pseudo code for the algorithm:
volatile int head = 0;
volatile int tail = 0;
const int size = capacity + 1; // the length of the buffer, see rule #4
Item buffer[size]; // the storage for the queuePush an item on the queue. Return true if successful, false otherwise. It should only fail if the queue is full.
bool push(Item t)
{
int ourHead = head; // atomic fetch
int next = (ourHead + 1) % size;
if (next == tail) // the queue is full
{
return false;
}
else
{
buffer[ourHead] = t;
head = next; // atomic store
return true;
}
}Pop an item off the queue. Return true if successful, false otherwise. Should only fail if the queue is empty.
bool pop(Item &t)
{
int ourTail = tail; // atomic fetch
if (ourTail != head) // the queue is not empty
{
t = buffer[ourTail];
ourTail = (ourTail + 1) % size;
tail = ourTail; // atomic store
return true.
}
else
{
return false
}
}The interesting thing about this implementation of a queue isn't that it works in the single reader/single writer case it's how easy it is to make it work in the multiple readers/writers cases. It turns out that this code works just fine for multiple readers and multiple writers so long as no more than one reader and one writer is active at the same time. That means that all we have to do is guard push() and pop() with their own mutexes and we have a solution that works for all cases.
Here is pseudo code for a simple way to wrap push() and pop() to create thread safe tsPush() and tsPop().
Mutex pushLock = Unlocked;
Mutex writeLock = Unlocked;
bool tsPush(Item t)
{
bool result;
Lock(pushLock);
result = push(t);
Unlock(pushLock);
return result;
}
bool tsPop(Item &t)
{
bool result;
Lock(popLock);
result = pop(t);
Unlock(popLock);
return result;
}Both push() and pop() return a boolean for a reason; Both operations can fail. Push() will fail if the queue is full and pop() will fail if the queue is empty. That means your code must call these functions inside a loop that acts just like the loop inside a spin lock. This is another case where you might want to use operating system provided mutex rather than rolling your own.
It looks like we can support queues without the need for any new atomic operations.
Readers/Writers Problem
Think about a data base with many threads reading and writing it at the same time. If you try to read a record and write the same record at the same time the reader might get half of the information from what was in the record before the write started and half from after the write started. The same kind of nastiness can happen if two threads try to write the record at the same time. OTOH, I can have hundreds of threads reading the record all at the same time with no trouble because they do not change any of the information in the record. That means we can allow any number of readers to be active at a time, but only one writer can be active at a time. You also want to give writes first chance at a record. A writer is carrying new information. Any reader that comes along after the writer needs to wait until all writers are complete so that the reader gets the most up to date information. This is a well known problem and was solved in the early 1970s.
The following pseudo code shows the standard solution to the problem worked out way back in 1971:
int readCount = 0;
int writeCount = 0;
Mutex readCountLock = Unlocked;
Mutex writeCountLock = Unlocked;
Mutex readEntryLock = Unlocked;
Mutex writeLock = Unlocked;
Mutex readLock = Unlocked;
READER()
Lock(readEntryLock)
Lock(readLock);
Lock(readCountLock);
readCount++;
if(readCount == 1) Lock(writeLock);
Unlock(readCountLock);
Unlock(readLock);
Unlock(readEntryLock)
Read Data Here
Lock(readCountLock);
readCount--;
if(readCount == 0) Unlock(writeLock);
Unlock(readCountLock);
WRITER()
Lock(writeCountLock);
writeCount++;
if(writeCount == 1) Lock(readLock);
Unlock(writeCountLock);
Lock(writeLock);
Write Data Here
Unlock(writeLock);
Lock(writeCountLock);
writeCount--;
if(writeCount == 0) Unlock(readLock);
Unlock(writeCountLock);For a much better description of the problem and other solutions please see the original paper 'Communications of the ACM: Concurrent Control with "Readers" and "Writers"' P.J. Courtois,* F. H, 1971 Communications of the ACM:Concurrent Control with "Readers" and "Writers" P.J. Courtois,* F. H, 1971
The code cleanly breaks down into four pieces, a readers lock and unlock and a writers lock and unlock. Some thread packages provide exactly those locks.
The first time you look at the code it seems pretty complex, but notice that blocks like:
Lock(writeCountLock); writeCount--; if(writeCount == 0) Unlock(readLock); Unlock(writeCountLock);
if (DecrementThenFetch(writeCount) == 0) Unlock(readLock);That means two of the mutexes can be eliminated using atomic operations. And, all the locks can be replaced by spin locks. So, we don't need any new operations to solve the readers/writers problems.
Conclusions
This is where I am supposed to enumerate the vast wisdom I gained from analyzing these four problems. I'm afraid I didn't learn as much as I hoped. Or, maybe I learned that my original opinion was correct. Not sure which. In either case it looks like the current set of operations is pretty good.
I did learn that atomic exchange has to be removed. It isn't supported on all systems. I want get rid of emulation when ever I can so I'm just going to get rid of atomic exchange. The atomic compare-then-set operation has to go too. It is not widely supported so it is not portable.
A reader recently pointed out that my current emulation code is totally and completely broken. It has a nasty race condition in it. That validates my decision to reduce or eliminate emulated operations.
I did learn that fetch-then-operate functions are mostly useful for implementing various kinds of locks.
I learned that FetchThenOr and AndThenFetch are handy for implementing atomic operations on sets. (I actually learned that from looking at the functions provided by the different operating systems and the section on spin locks.) As far as I can tell all the systems provide either the ability to implement atomic set operations or they provided them directly. OTOH, they all provide something different.
After looking at the above examples I gained a lot of respect for the set of atomic operations supplied by QNX. It is small, but it can handle all the situations I ran into in my examples.
I've also spent way to much time thinking about operand sizes. I didn't see one case where I would gain anything by having atomic operations on pointers, so that idea gets tossed. No atomic ops on pointers.
I was very tempted to get rid of the operate-then-fetch operations because they are not available on QNX. But, they can be emulated using the fetch-then-operate functions plus a little bit of extra math. For example:
op1 = AtomicAddThenFetch(*op2, 22);is the same as:
op1 = AtomicFetchThenAdd(*op2, 22) + 22;But, it is not the same as:
AtomicAdd(*op2, 22); op1 = op2;
because the value of op2 could be changed after the atomic add and before its value was assigned to op1. Doing the extra math is a hack, not elegant, but acceptable. Also, it does not work for logical operations such as and.
Most of the redesign consists of removing functions and coming up with a correct way to do emulation. My current, incorrect, emulation code is based on using an SDL_mutex which must be created at runtime. In my current code the way the mutex is created contains a huge race condition. I can only think of one way around the problem and that is to use a spin lock. The nice thing about a spin lock is that the data part of the lock is just a single int that can be created and initialized at compile time. Creating the lock at compile time eliminates my race condition. But, it means that the machine must be able to implement a spin lock or else it can not emulate anything. That seems like a reasonable requirement to me and that is what I am going to do.
References
Here is a list of web sites that document the atomic operations provided by GCC, Mac OS X, Windows, and QNX:
GCC:
http://gcc.gnu.org/onlinedocs/gcc-4.4.0/gcc/Atomic-Builtins.html
Mac OS X:
http://developer.apple.com/documentation/Cocoa/Conceptual/Multithreading/ThreadSafety/ThreadSafety.html#//apple_ref/doc/uid/10000057i-CH8-SW14
http://developer.apple.com/documentation/Darwin/Reference/ManPages/man3/OSAtomicIncrement32Barrier.3.html
Windows:
http://msdn.microsoft.com/en-us/library/ms686353(VS.85).aspx
http://msdn.microsoft.com/en-us/library/5704bbxw.aspx
QNX:
http://www.qnx.com/developers/docs/6.3.0SP3/multicore_en/user_guide/how_to.html
If you find more or better links, or links for other operating systems, please let me know so that I can add them to the list.