SDL: Adding Atomic Operations to SDL 1.3, part #2 All the information I can find

Well, this isn't what I thought I was going to be writing as part #2, but the replies I got to part #1 were all from people who wanted to help, but who did not feel they knew enough to help. So, part #2 is going to be a very basic introduction to atomic operations that builds up to a pretty detailed discussion of intrinsic atomic operations in GCC, Windows, and Mac OS X. Lets start with a glossary. You can argue with me about the definitions, please do, but this is the way I define these terms.

Glossary

Processor: AKA the CPU. The processor interprets (executes) machine language instructions. In the bad old days one processor could execute one sequence of instructions at a time. Now days a processor can have several “cores” that each executes one or more sequences of instructions. Not only that, but hyperthreading cores can execute more than one sequence of instructions at a time. A modern processor can have many cores and each core can execute several sequences of instructions at a time. Each individual sequence of instructions being run by a core is called a hardware thread.

Hardware Thread: A sequence of instructions being run by a processor core. Most cores run one thread at a time. Hyperthreading cores can run two or more hardware threads at a time. The number of hardware threads is determined by the hardware. That number can not be changed without changing the hardware.

Software Thread: An execution context. A software thread is a block of memory with space for the thread's stack pointer, instruction pointer, all machine registers, every thing the hardware needs to know to be able to get a hardware thread doing work for a software thread. When a software thread is active all its information is in real hardware registers. When a software thread is suspended the information is copied into the memory. When a software thread is started up the information is copied back into real registers. The number of software threads is limited only by the amount of memory needed to store them.

Mutex: Mutual Exclusion. An set of operations for coordinating cooperation and communication between software threads. These were invented way back in the early days of computing when software threads were plentiful and hardware threads were scarce. They are more concerned with maximizing use of processors than with performance of individual software threads.

Atomic Operations: The word “atomic” means indivisible. If you have ever seen two children grab the same cookie you know why we need atomic operations. If grabbing a cookie were an atomic operation then it would be impossible for two children to grab the same cookie. Atomic cookie grabbing would make sure that no two kids can grab a cookie at the same time and make parents much happier. Atomic operations running in one hardware thread are guaranteed to run without interference by instructions running in another hardware thread. If an atomic operation is going to read, or write a variable it is guaranteed that no other operation, atomic or otherwise, running on any thread in any core in any processor, can read, or write that variable until the current atomic operation has completed.

Barrier: AKA read/write barrier and load/store barrier. Both compilers and processors feel free to reorganize code to make it run faster. This is generally considered to be a good thing. Compilers recognize that sometimes the order that you wrote your code doesn't matter and that by reordering it your code can run faster. Processors also reorder code. Processors often have three or more instructions executing at the same time. If your processor has a register named A and three sequences of instructions in row all use A, but none of them use the value of A used by any of the other three, your processor might play a game of make believe and make believe that it has three A registers so that it can execute all three sequences of instructions at the same time. Processors play even stranger games than that. Look up “speculative execution” if you don't believe me.

Reordering causes serious problems for atomic operations. Let's say you have code that initializes a structure and then you use an atomic operation to set a flag telling other threads that the structure is ready for use. Now imaging the compiler or the processor moves the atomic operation so that it executes before the initialization is done. What happens? You get what is called a race condition. Does the initialization finish before another thread starts using the structure? If no, then you have trouble. If, yes then things work just fine. The result is a bug that is nearly impossible to find.

A “barrier” is recognized by both compilers and processors as something that must not be crossed. Code written before the barrier always runs before the barrier is reached. Code written after the barrier always runs after the barrier is processed. Barriers can slow down your code. But, they are required for your code to work correctly. Therefore, all SDL atomic operations are barriers. There are lots of different kinds of barriers. All I'm going to say is that SDL atomic operations are barriers of a kind needed to make sure the operation works as you would expect it to work, no surprises. Most are full barriers, but some are partial barriers.

Volatile: Volatile is a C/C++ storage class. That is it modifies a type like int so you can say that a variable is a “volatile int” just like you can say that a variable is an “unsigned int”. Volatile is most strange. It tells the compiler that the value of a variable might change when no one is looking. If your car had a volatile color you would never know what color your car is when you look for it in the parking lot at the mall. The color could just change to any color at any time.

Programmers like to think of variables as being somewhere in memory. Compilers like to move variables into registers. Compilers can do that because they know every place in a program that uses the variable so it always knows where the value is. But, what if an atomic operation running in another thread comes along and changes the value in memory while the compiler thinks the value is in a register? Well, you get the wrong value. Even worse, entire chunks of code may be removed from your program. Consider:

int flag = 0;

Do some stuff...

while(flag)

{

Do some other stuff...

}

 

The compiler sees that flag is always zero so it can safely remove the entire while loop from the program. The compiler doesn't know that code in another thread is going to change flag so that sometimes the while loop does get executed. On the other hand, if we change the declaration of flag to look like:

int volatile flag = 0;

Then the compiler knows that it can not make any assumptions about the value of flag and it will leave the while loop alone.

References

Here is a list of web sites that document the atomic operations provided by GCC, Mac OS X, Windows, and QNX:

GCC:

http://gcc.gnu.org/onlinedocs/gcc-4.4.0/gcc/Atomic-Builtins.html

 

Mac OS X:

http://developer.apple.com/documentation/Cocoa/Conceptual/Multithreading/ThreadSafety/ThreadSafety.html#//apple_ref/doc/uid/10000057i-CH8-SW14
http://developer.apple.com/documentation/Darwin/Reference/ManPages/man3/OSAtomicIncrement32Barrier.3.html

Windows:

http://msdn.microsoft.com/en-us/library/ms686353(VS.85).aspx

http://msdn.microsoft.com/en-us/library/5704bbxw.aspx

QNX:

http://www.qnx.com/developers/docs/6.3.0SP3/multicore_en/user_guide/how_to.html

If you find more or better links, or links for other operating systems, please let me know so that I can add them to the list.

Thinking Out Loud

It is very tempting to try to “out think” everyone in the world and try to come up with a high level set of atomic operations. Fortunately, I'm smart enough to know that 1) there are a lot of people in the world who are smarter than I am (hard to believe, but true nonetheless :-), and 2) even if I were the smartest person in the world I only have so much time to think about the problem. I do not have time in my life to come up with all the ways that people might use atomic operations. So, it makes sense to grab a set of existing low level operations and make them part of SDL.

From studying the lists I see several rough classes of operations:

  1. Fetch-Then-Operate: Fetch the value of a variable and then perform and operation on it. Somewhat like C's i++.

  2. Operate-Then-Fetch: Perform and operation on a variable and then fetch its new value. Somewhat like C's ++i.

  3. Barrier: Stand alone barrier operations. Since all of the SDL atomic operations will be barriers I'm going to ignore these. They may be useful for implementing SDL atomic operations, but they will not be part of the SDL atomic operations.

  4. Exchange: Swap the value of two variables.

  5. Swap-Then-Compare: Oddly, these are usually call by a name like test-and-set. These operations swap two values and then test the old value against a known value. They can be implemented using Exchange.

  6. Compare-Then-Swap: First compare a known value against the value of a variable and if they are equal set the variable to a new value.

  7. Clear: Set a variable to a known value, usually, but not always, zero.

Not all kinds of processors support all these operations and they do not support them on all sizes of operands.

What follows are the atomic operations from GCC, Mac OS X, and Windows that meet my requirement of being a barrier. It is important to note that even when the operations have the same, or similar, names they are not necessarily the same operation.

GCC:

The GCC atomic operations work on 8, 16, 32, and 64, bit operations, but not on all machines. For example, on the x86 architecture 64 bit operands are supported starting with the Pentium, but not for any previous processors. All but test-and-set are full barriers and test-and-set is a partial barrier sufficient to ensure that it works as expected.

Fetch-Then-Operate:

type __sync_fetch_and_add (type *ptr, type value, ...)

type __sync_fetch_and_sub (type *ptr, type value, ...)

type __sync_fetch_and_or (type *ptr, type value, ...)

type __sync_fetch_and_and (type *ptr, type value, ...)

type __sync_fetch_and_xor (type *ptr, type value, ...)

type __sync_fetch_and_nand (type *ptr, type value, …)

 

Operate-Then-Fetch:

type __sync_add_and_fetch (type *ptr, type value, ...)

type __sync_sub_and_fetch (type *ptr, type value, ...)

type __sync_or_and_fetch (type *ptr, type value, ...)

type __sync_and_and_fetch (type *ptr, type value, ...)

type __sync_xor_and_fetch (type *ptr, type value, ...)

type __sync_nand_and_fetch (type *ptr, type value, …)

 

Barrier:

__sync_synchronize (…)

 

Exchange:

type __sync_lock_test_and_set (type *ptr, type value, …)

(on some processors)

 

Swap-Then-Compare

type __sync_lock_test_and_set (type *ptr, type value, …)

 

Compare-Then-Swap

bool __sync_bool_compare_and_swap (type *ptr, type oldval type newval, ...)

type __sync_val_compare_and_swap (type *ptr, type oldval type newval, …)

 

Clear

void __sync_lock_release (type *ptr, ...)

 

Mac OS X:

The Mac provides a fairly small set of atomic operations. Its version of test-and-set allows testing and setting specific bits in a value which is very different from the GCC version of test-and-set. Remarkably, there is no exchange operation.

Fetch-Then-Operate:

int32_t OSAtomicOr32OrigBarrier(uint32_t theMask, volatile uint32_t *theValue)

int32_t OSAtomicAnd32OrigBarrier(uint32_t theMask, volatile uint32_t *theValue)

int32_t OSAtomicXor32OrigBarrier(uint32_t theMask, volatile uint32_t *theValue)

 

Operate-Then-Fetch:

int32_t OSAtomicAdd32Barrier(int32_t theAmount, volatile int32_t *theValue)

int32_t OSAtomicIncrement32Barrier(volatile int32_t *theValue)

int32_t OSAtomicDecrement32Barrier(volatile int32_t *theValue)

int32_t OSAtomicOr32Barrier(uint32_t theMask, volatile uint32_t *theValue)

int32_t OSAtomicAnd32Barrier(uint32_t theMask, volatile uint32_t *theValue)

int32_t OSAtomicXor32Barrier(uint32_t theMask, volatile uint32_t *theValue)

int64_t OSAtomicAdd64Barrier(int64_t theAmount, volatile int64_t *theValue)

int64_t OSAtomicIncrement64Barrier(volatile int64_t *theValue)

int64_t OSAtomicDecrement64Barrier(volatile int64_t *theValue)

 

Barrier:

void OSMemoryBarrier(void);

 

Exchange:

 

Swap-Then-Compare

bool OSAtomicTestAndSetBarrier(uint32_t n, volatile void *theAddress)

(this is a bit set and test operation)

 

Compare-Then-Swap

bool OSAtomicCompareAndSwapIntBarrier(int oldValue, int newValue, volatile int *theValue)

bool OSAtomicCompareAndSwapLongBarrier(long oldValue, long newValue, volatile long *theValue)

bool OSAtomicCompareAndSwapPtrBarrier(void* oldValue, void* newValue, void* volatile *theValue)

bool OSAtomicCompareAndSwap32Barrier(int32_t oldValue, int32_t newValue, volatile int32_t *theValue)

bool OSAtomicCompareAndSwap64Barrier(int64_t oldValue, int64_t newValue, volatile int64_t *theValue)

bool OSAtomicCompareAndSwapPtrBarrier(void* oldValue, void* newValue, void* volatile *theValue)

 

Clear

bool OSAtomicTestAndClearBarrier(uint32_t n, volatile void *theAddress);

(this is a bit set and test operation)

 

Windows:

Windows has a rich set of atomic operations. Not as complete as the set provided by GCC, but sufficient for what I want to do.

Fetch-Then-Operate:

long _InterlockedAnd(long volatile * value, long mask);

char _InterlockedAnd8(char volatile * value, char mask);

short _InterlockedAnd16(short volatile * value, short mask);

__int64 _InterlockedAnd64(__int64 volatile* value, __int64 mask);

long _InterlockedOr(long volatile * Value, long Mask);

char _InterlockedOr8(char volatile * Value, long Mask);

short _InterlockedOr16(short volatile * Value, short Mask);

__int64 _InterlockedOr64(__int64 volatile * Value, __int64 Mask);

long _InterlockedXor(long volatile * Value, long Mask);

char _InterlockedXor8(char volatile * Value, char Mask);

short _InterlockedXor16(short volatile * Value, short Mask);

__int64 _InterlockedXor64(__int64 volatile * Value, __int64 Mask);

long _InterlockedExchangeAdd(long volatile * Addend, long Value);

__int64 _InterlockedExchangeAdd64(__int64 volatile * Addend, __int64 Value);

 

Operate-Then-Fetch:

long _InterlockedAdd(long volatile * Addend, long Value);

__int64 _InterlockedAdd64(__int64 volatile * Addend, __int64 Value);

long _InterlockedIncrement(long * lpAddend);

short _InterlockedIncrement16(short * lpAddend);

__int64 _InterlockedIncrement64(__int64 * lpAddend);

long _InterlockedDecrement(long * lpAddend);

short _InterlockedDecrement16(short * lpAddend);

__int64 _InterlockedDecrement64(__int64 * lpAddend);

 

Barrier:

void _ReadWriteBarrier(void);

 

Exchange:

long _InterlockedExchange(long * Target, long Value)

__int64 _InterlockedExchange64(__int64 * Target, __int64 Value)

void * _InterlockedExchangePointer(void * volatile * Target, void * Value)

 

Swap-Then-Compare

unsigned char _interlockedbittestandset(long *a, long b)

unsigned char _interlockedbittestandset64(__int64 *a, __int64 b)

 

Compare-Then-Swap

long _InterlockedCompareExchange(long volatile * Destination, long Exchange, long Comperand)

short _InterlockedCompareExchange16(short volatile * Destination, short Exchange, short Comperand)

__int64 _InterlockedCompareExchange64(__int64 volatile * Destination, __int64 Exchange, __int64 Comperand)

void * _InterlockedCompareExchangePointer(void * volatile * Destination, void * Exchange, void * Comperand)

 

Clear

unsigned char _interlockedbittestandreset(long *a, long b)

unsigned char _interlockedbittestandreset64(__int64 *a, __int64 b)

 

QNX:

None of these are listed as being memory barriers in the QNX documentation I have been able to find. But, I include them because they are another example of atomic operations. I find this set to be very interesting because the set is so small and yet so complete.

Fetch-Then-Operate:

unsigned atomic_add_value(volatile unsigned * loc, unsigned incr);

unsigned atomic_clr_value(volatile unsigned * loc, unsigned bits);

unsigned atomic_set_value(volatile unsigned * loc, unsigned bits);

unsigned atomic_sub_value(volatile unsigned * loc, unsigned decr);

unsigned atomic_toggle_value(volatile unsigned * loc, unsigned bits);

 

Operate-Then-Fetch:

 

These are not really operate-then-fetch operations because they do not return a value.

 

void atomic_add(volatile unsigned * loc, unsigned incr);

void atomic_set(volatile unsigned * loc, unsigned bits);

void atomic_sub(volatile unsigned * loc, unsigned decr);

void atomic_toggle(volatile unsigned * loc, unsigned bits);

 

Barrier:

There seems to be some barrier macros, but they are not well documented.

 

Exchange:

 

Swap-Then-Compare

 

Compare-Then-Swap

 

Clear

unsigned atomic_clr_value(volatile unsigned * loc, unsigned bits);

void atomic_clr(volatile unsigned * loc, unsigned bits);

 

That is what we have to work with. What can we implement with all this? Here are some observations:

  • Mac OS X has no atomic exchange operation.

  • QNX has no atomic exchange operation.

  • QNX has not swap-then-compare or compare-then-swap operations.

  • Mac OS X has no support for 8 and 16 bit operations.

  • QNX seems to only support 32 bit operations.

  • Windows has limited support for 8 and 16 bit operations.

  • Mac OS X has no arithmetic operations in the Fetch-Then-Operate class.

  • QNX has no real Operate-Then-Fetch operations.

  • All the numeric operations on Mac and Windows are signed, all of QNX's are unsigned, and GCC allows any type of argument.

  • Window and Mac provide operations for pointers, not just integers.

  • Test-and-set/Clear are quite different between the three operating systems.

My original design for SDL atomic operations is in conflict with the Windows, Mac OS X, and QNX in several places:

  1. I support 8 and 16 bit operations. Only GCC supports these. Three other systems do not.

  2. All my numeric operations are on unsigned values.

  3. I do not provide pointer versions of the operations. I realize now that that decision forces the programmer to know the pointer size which makes it much harder to write portable code.

  4. I have an exchange operation. Several of the others do not.

  5. Operate-then-fetch is not possible on QNX.

  6. I specify a size for the lock variable for test-and-set/clear.

  7. Windows, Mac OS X, and GCC all support boolean atomic operations that I do not support.

  8. QNX supports an “interesting” set of operations that are not the standard boolean operations.

It looks like a redesign is needed. We'll take that up in part #3.

 

 

 

Who's online

There are currently 0 users and 3 guests online.