I normally avoid this sort of low-level stuff like the plague, but: since I do lot of multi-threaded programming with large, dynamic data structures this has turned out to matter…
I wound up doing C++ in Windows again, and this time decided to look at InterlockedIncrement and friends. I also looked at atomic.h in my preferred operating system, Linux.
The size and performance difference between mutex lock/unlock as compared to an atomic bus-locked operations is:
On my 64-bit Ubuntu system, it takes a 40-byte mutex to synchronize access to an 8-byte integer or pointer. While I prefer the mutex for portability and features, it’s not practical for wrapping very simple counter or pointer swap operations when there are billions of objects.
An atomic operation in assembler consists of placing the “lock” prefix before certan operations, e.g., compare and exchange, increment, decrement, various ors and ands. The lock prefix causes the system to lock the memory bus for the location of the variable you’re operating on.
Each mutex *lock* runs about 20 assembler instructions, of which *two* are atomic instructions. After that’s done, you can do your increment or swap. Finally, you call pthread_mutex_unlock, for another dozen or two instructions, including two more atomic operations. The fact that the pthread_mutex_lock/unlock are function calls adds some overhead, but that’s a drop in the bucket after everything else is considered.
Finally, you can expect a several-fold increase in speed when using the atomic operations. A single atomic operation is about three or four assembler instructions, of which one is the increment, decrement, or whatever that you’re interested in.
Atomic operations require no initialization . If you’re creating and destroying thousands of objects per second then a pthread mutex or that other system’s CriticalSection add more burden.
A call to pthread_mutex_lock requires 19 assembly instructions… and then there’s still the unlock.
'>
0x00002abee47cd2f0 <pthread_mutex_lock+0> push %rbp
0x00002abee47cd2f1 <pthread_mutex_lock+1> mov %rdi,%r8
0×00002abee47cd2f4 <pthread_mutex_lock+4> push %rbx
0×00002abee47cd2f5 <pthread_mutex_lock+5> sub $0×8,%rsp
0×00002abee47cd2f9 <pthread_mutex_lock+9> mov %fs:0×90,%r9d
0×00002abee47cd302 <pthread_mutex_lock+18> movslq 0×10(%rdi),%rax
0×00002abee47cd306 <pthread_mutex_lock+22> cmp $0×13,%rax
0×00002abee47cd30a <pthread_mutex_lock+26> ja 0×2abee47cd320 <pthread_mutex_lock+48>
0×00002abee47cd336 <pthread_mutex_lock+70> mov $0×1,%esi
0×00002abee47cd33b <pthread_mutex_lock+75> xor %eax,%eax
0×00002abee47cd33d <pthread_mutex_lock+77> lock cmpxchg %esi,(%r8)
0×00002abee47cd342 <pthread_mutex_lock+82> jne 0×2abee47cd5d9 <pthread_mutex_lock+745>
0×00002abee47cd348 <pthread_mutex_lock+88> mov 0×8(%r8),%ecx
0×00002abee47cd34c <pthread_mutex_lock+92> test %ecx,%ecx
0×00002abee47cd34e <pthread_mutex_lock+94> jne 0×2abee47cd559 <pthread_mutex_lock+617>
0×00002abee47cd354 <pthread_mutex_lock+100> addl $0×1,0xc(%r8)
0×00002abee47cd359 <pthread_mutex_lock+105> mov %r9d,0×8(%r8)
0×00002abee47cd35d <pthread_mutex_lock+109> add $0×8,%rsp
0×00002abee47cd361 <pthread_mutex_lock+113> pop %rbx
0×00002abee47cd362 <pthread_mutex_lock+114> pop %rbp
0×00002abee47cd363 <pthread_mutex_lock+115> xor %eax,%eax
0×00002abee47cd365 <pthread_mutex_lock+117> retq
The atomic increment, on the other hand, is about four assembler operations, not counting the stack setup and return. And there’s no subsequent unlock step.
0x0000000000400628 <increment2+0> push %rbp
0x0000000000400629 <increment2+1> mov %rsp,%rbp
0×000000000040062c <increment2+4> mov %rdi,0xfffffffffffffff8(%rbp)
0×0000000000400630 <increment2+8> mov 0xfffffffffffffff8(%rbp),%rdx
0×0000000000400634 <increment2+12> mov 0xfffffffffffffff8(%rbp),%rax
0×0000000000400638 <increment2+16> lock incl (%rdx)
0×000000000040063b <increment2+19> leaveq
0×000000000040063c <increment2+20> retq
If you’re created and destroying thousands of objects per second, each with its own mutex, then you’ll have to take mutex initialization and teardown into account.
The code below, source of the assembler dumps, illustrates the two approaches. The assembler code is lifted from /usr/include/asm-x86_64/atomic.h.
#include <pthread.h>
#include <stdio.h>
pthread_mutex_t gMutex = PTHREAD_MUTEX_INITIALIZER;
void mutexed_increment(unsigned long* pn)
{
pthread_mutex_lock(&gMutex);
++*pn;
pthread_mutex_unlock(&gMutex);
}
static inline atomic_increment(unsigned long* pn)
{
__asm__ __volatile__(
"lock; incl %0"
:"=m" (*pn)
:"m" (*pn));
}
int main (int argc, char **argv)
{
unsigned long i = 0;
printf( "Size of pthread_mutex_t: %u; sizeof long: %u\n",
sizeof(pthread_mutex_t),
sizeof(i));
mutex_increment(&i);
atomic_increment(&i);
printf("i=%u\n", i );
return 0;
}
Conclusion
Native synchronization code can offer significant benefits in high-performance server applications, especially where there is a very large number of objects to be synchronized, and where synchronized access occurs at a high rate.
- Several-fold memory savings; synchronized object requires no corresponding synchronization object
- Several-fold reduction in clock cycles per object access
Measurable if there is a very high rate of access to the synchronized objects
- No mutex initialization or tear-down.
Useful when synchronized objects are created and destroyed at high rates