Lockless Queue in C# slower than Locked Queue in C#
-
I've got a multi-threaded implementation of a lockless queue in C# using Interlocked.CompareExchange() and a multi-threaded implementation of a locked queue that uses lock() when it comes to enqueueing and dequeueing items. Essentially, what happens is that I use System.Threading.ThreadPool.QueueUserWorkItem() to queue 5-10 pieces of work (some function), which enqueues and dequeues from the lockless/locked queue several hundred thousand times. A lot of articles have explained that using a lockless queue can improve performance when it comes to accessing a data structure, but I'm finding that the locked queue (which is simpler) ends up being faster. From a theoretical standpoint, I understand that synchronization primitives may be expensive, which is part of the reason why lockless should be faster. On the other hand, a lockless implementation with 5 operations working at the same time works by having one operation try and succeed, while for the other operations, they will have to redo their work again since the original data may have changed. That seems like a lot of waste to me and (again, from a theoretical standpoint) not faster than a locked implementation or even a single-threaded implementation. Also, just to give some context, I'm only using a dumb queue. There's nothing special or intensive about what I'm doing, it's just creating a node to add to the queue, and taking a node out. In the locked multi-threaded implementation, only the node creation part is not inside the lock(). Everything else is (essentially) single-threaded since it has to wait for the lock before adding or removing a node. Is anyone able to explain why I'm getting the performance results that I'm getting, and generally, what advantages there are to lockless aside from what I'm already expecting (faster multi-threaded queue access).
-
I've got a multi-threaded implementation of a lockless queue in C# using Interlocked.CompareExchange() and a multi-threaded implementation of a locked queue that uses lock() when it comes to enqueueing and dequeueing items. Essentially, what happens is that I use System.Threading.ThreadPool.QueueUserWorkItem() to queue 5-10 pieces of work (some function), which enqueues and dequeues from the lockless/locked queue several hundred thousand times. A lot of articles have explained that using a lockless queue can improve performance when it comes to accessing a data structure, but I'm finding that the locked queue (which is simpler) ends up being faster. From a theoretical standpoint, I understand that synchronization primitives may be expensive, which is part of the reason why lockless should be faster. On the other hand, a lockless implementation with 5 operations working at the same time works by having one operation try and succeed, while for the other operations, they will have to redo their work again since the original data may have changed. That seems like a lot of waste to me and (again, from a theoretical standpoint) not faster than a locked implementation or even a single-threaded implementation. Also, just to give some context, I'm only using a dumb queue. There's nothing special or intensive about what I'm doing, it's just creating a node to add to the queue, and taking a node out. In the locked multi-threaded implementation, only the node creation part is not inside the lock(). Everything else is (essentially) single-threaded since it has to wait for the lock before adding or removing a node. Is anyone able to explain why I'm getting the performance results that I'm getting, and generally, what advantages there are to lockless aside from what I'm already expecting (faster multi-threaded queue access).
Hi, my best guess would be your test code is too simple: if all the test does is adding and removing queue entries, without any real calculations in between, then most CPU cycles are spent in the spin-loop inside Interlocked. IMO Interlocked is intended to and will work fine when contentions are rare, not when you create them all the time; under light contention conditions, I expect it to work more efficient than any other synchronization scheme. To confirm this, you should try and come up with a test where extra CPU cycles are required to process the nodes; I would recommend adding something until you see the number of queue/dequeue operations go down by at least 80%, then for that computational load compare again the two schemes. :)
Luc Pattyn [Forum Guidelines] [My Articles]
Love, happiness and fewer bugs for 2009!
-
Hi, my best guess would be your test code is too simple: if all the test does is adding and removing queue entries, without any real calculations in between, then most CPU cycles are spent in the spin-loop inside Interlocked. IMO Interlocked is intended to and will work fine when contentions are rare, not when you create them all the time; under light contention conditions, I expect it to work more efficient than any other synchronization scheme. To confirm this, you should try and come up with a test where extra CPU cycles are required to process the nodes; I would recommend adding something until you see the number of queue/dequeue operations go down by at least 80%, then for that computational load compare again the two schemes. :)
Luc Pattyn [Forum Guidelines] [My Articles]
Love, happiness and fewer bugs for 2009!
I put another variable in my lockless queue to measure the retry count (the number of times it has to spin again because something changed), and in my test bench, it seems that an additional 50% of operations are done, so 100 enqueues ends up being something like 100 successful operations and 50 respins, approximately. I'm not sure how well my numbers compare with other implementations, but that does seem like a lot. It seems from the result that the lockless implementation works best with little contention, but I'm curious to know the theory behind it.
-
I put another variable in my lockless queue to measure the retry count (the number of times it has to spin again because something changed), and in my test bench, it seems that an additional 50% of operations are done, so 100 enqueues ends up being something like 100 successful operations and 50 respins, approximately. I'm not sure how well my numbers compare with other implementations, but that does seem like a lot. It seems from the result that the lockless implementation works best with little contention, but I'm curious to know the theory behind it.
Cyrilix wrote:
measure the retry count (the number of times it has to spin again because something changed)
Not sure I understand what you mean. Care to share some code here?
Cyrilix wrote:
I'm curious to know the theory behind it.
You could use reflector to investigate how it has been implemented in .NET although I expect you would also need to study the IL specification and still will be kept in the dark on how the native code looks (it is very likely they use special assembly instructions in the end). :)
Luc Pattyn [Forum Guidelines] [My Articles]
Love, happiness and fewer bugs for 2009!
-
Cyrilix wrote:
measure the retry count (the number of times it has to spin again because something changed)
Not sure I understand what you mean. Care to share some code here?
Cyrilix wrote:
I'm curious to know the theory behind it.
You could use reflector to investigate how it has been implemented in .NET although I expect you would also need to study the IL specification and still will be kept in the dark on how the native code looks (it is very likely they use special assembly instructions in the end). :)
Luc Pattyn [Forum Guidelines] [My Articles]
Love, happiness and fewer bugs for 2009!
My internet was down for a little so I couldn't post right away.
public void Enqueue(T data) { LockFreeQueueNode<T> tempTail = null; LockFreeQueueNode<T> tempTailNext = null; LockFreeQueueNode<T> newNode = new LockFreeQueueNode<T>(data); newNode.Data = data; do { tempTail = tail; tempTailNext = tempTail.NextNode; if (tempTail == tail) { if (tempTailNext == null) { // If the tail node we are referring to is really the last // node in the queue (i.e. its next node is null), then // try to point its next node to our new node // if (Interlocked.CompareExchange(ref tempTail.NextNode, newNode, tempTailNext) == tempTailNext) break; } else { // This condition occurs when we have failed to update // the tail's next node. And the next time we try to update // the next node, the next node is pointing to a new node // updated by other thread. But the other thread has not yet // re-pointed the tail to its new node. // So we try to re-point to the tail node to the next node of the // current tail // Interlocked.CompareExchange(ref tail, tempTailNext, tempTail); } } **Interlocked.Increment(ref retryCount);** } while (true); // If we were able to successfully change the next node of the current tail node // to point to our new node, then re-point the tail node also to our new node // Interlocked.CompareExchange(ref tail, newNode, tempTail); Interlocked.Increment(ref count); }
I believe this code was actually taken somewhere from a Codeproject article (but I happened to find it online from a different source). The part that's in bold is the important line that I added to measure the retry count, basically everytime we are unable to break out of the while loop and need to retry the operation again.
-
My internet was down for a little so I couldn't post right away.
public void Enqueue(T data) { LockFreeQueueNode<T> tempTail = null; LockFreeQueueNode<T> tempTailNext = null; LockFreeQueueNode<T> newNode = new LockFreeQueueNode<T>(data); newNode.Data = data; do { tempTail = tail; tempTailNext = tempTail.NextNode; if (tempTail == tail) { if (tempTailNext == null) { // If the tail node we are referring to is really the last // node in the queue (i.e. its next node is null), then // try to point its next node to our new node // if (Interlocked.CompareExchange(ref tempTail.NextNode, newNode, tempTailNext) == tempTailNext) break; } else { // This condition occurs when we have failed to update // the tail's next node. And the next time we try to update // the next node, the next node is pointing to a new node // updated by other thread. But the other thread has not yet // re-pointed the tail to its new node. // So we try to re-point to the tail node to the next node of the // current tail // Interlocked.CompareExchange(ref tail, tempTailNext, tempTail); } } **Interlocked.Increment(ref retryCount);** } while (true); // If we were able to successfully change the next node of the current tail node // to point to our new node, then re-point the tail node also to our new node // Interlocked.CompareExchange(ref tail, newNode, tempTail); Interlocked.Increment(ref count); }
I believe this code was actually taken somewhere from a Codeproject article (but I happened to find it online from a different source). The part that's in bold is the important line that I added to measure the retry count, basically everytime we are unable to break out of the while loop and need to retry the operation again.
Hi, OK, if retryCount goes up it tells me nothing much is happening outside your queue; you and I agree on "lockless implementation works best with little contention" and to reduce contention a lot of things should be going on outside the queue. :)
Luc Pattyn [Forum Guidelines] [My Articles]
Love, happiness and fewer bugs for 2009!
-
My internet was down for a little so I couldn't post right away.
public void Enqueue(T data) { LockFreeQueueNode<T> tempTail = null; LockFreeQueueNode<T> tempTailNext = null; LockFreeQueueNode<T> newNode = new LockFreeQueueNode<T>(data); newNode.Data = data; do { tempTail = tail; tempTailNext = tempTail.NextNode; if (tempTail == tail) { if (tempTailNext == null) { // If the tail node we are referring to is really the last // node in the queue (i.e. its next node is null), then // try to point its next node to our new node // if (Interlocked.CompareExchange(ref tempTail.NextNode, newNode, tempTailNext) == tempTailNext) break; } else { // This condition occurs when we have failed to update // the tail's next node. And the next time we try to update // the next node, the next node is pointing to a new node // updated by other thread. But the other thread has not yet // re-pointed the tail to its new node. // So we try to re-point to the tail node to the next node of the // current tail // Interlocked.CompareExchange(ref tail, tempTailNext, tempTail); } } **Interlocked.Increment(ref retryCount);** } while (true); // If we were able to successfully change the next node of the current tail node // to point to our new node, then re-point the tail node also to our new node // Interlocked.CompareExchange(ref tail, newNode, tempTail); Interlocked.Increment(ref count); }
I believe this code was actually taken somewhere from a Codeproject article (but I happened to find it online from a different source). The part that's in bold is the important line that I added to measure the retry count, basically everytime we are unable to break out of the while loop and need to retry the operation again.
Contention is always bad, but generally lockless code should be faster under contention than using locks. With lockless code, when multiple threads use CompareExchange concurrently, at least one will succeed. So you're not wasting any time compared to the locked version. Did you include the Interlocked.Increment(ref retryCount) when timing? Like locks, interlocked operations aren't cheap, so you should avoid doing too many of them. Do you really need "count"? Currently you're doing at least 3 interlocked operations per insertion - maybe a lock really is faster in this case. And what hardware did you test it on? How many cores? Are you using hyper-threading? You might be better off by slowing down the retry operation - on hyper-threaded processors, retrying too often slows down the other thread running on that processor. And in general, all threads constantly accessing memory isn't a good idea. Try adding a small Thread.SpinWait when retrying.
-
Contention is always bad, but generally lockless code should be faster under contention than using locks. With lockless code, when multiple threads use CompareExchange concurrently, at least one will succeed. So you're not wasting any time compared to the locked version. Did you include the Interlocked.Increment(ref retryCount) when timing? Like locks, interlocked operations aren't cheap, so you should avoid doing too many of them. Do you really need "count"? Currently you're doing at least 3 interlocked operations per insertion - maybe a lock really is faster in this case. And what hardware did you test it on? How many cores? Are you using hyper-threading? You might be better off by slowing down the retry operation - on hyper-threaded processors, retrying too often slows down the other thread running on that processor. And in general, all threads constantly accessing memory isn't a good idea. Try adding a small Thread.SpinWait when retrying.
Yes, I include everything when timing. I also saw that count could be eliminated to no negative effect on the algorithm itself. I just wouldn't have anything to keep track of approximately what the count is, which is OK. The hardware that I'm running it on is an Athlon64 X2 4400+ (2.2 GHz) with 2 cores and no hyper-threading. One thing I did was instead of slowing down the retry, I slowed down the enqueue/dequeue rate, by adding some other intensive math in between. This reduces the contention rate significantly. Before, I'd have about 50% extra operations... slowing it down to a reasonable level drops the amount of additional operations down to 4-6%. Under such a case, the lockless algorithm is equal or better to the lock algorithm. One question though, what do you mean to imply when you say "With lockless code, when multiple threads us CompareExchange concurrently, at least one will succeed." Isn't it the same with locks? When multiple threads try to enter a critical section, at least one, and only one will succeed. The others will wait their turn.
-
Yes, I include everything when timing. I also saw that count could be eliminated to no negative effect on the algorithm itself. I just wouldn't have anything to keep track of approximately what the count is, which is OK. The hardware that I'm running it on is an Athlon64 X2 4400+ (2.2 GHz) with 2 cores and no hyper-threading. One thing I did was instead of slowing down the retry, I slowed down the enqueue/dequeue rate, by adding some other intensive math in between. This reduces the contention rate significantly. Before, I'd have about 50% extra operations... slowing it down to a reasonable level drops the amount of additional operations down to 4-6%. Under such a case, the lockless algorithm is equal or better to the lock algorithm. One question though, what do you mean to imply when you say "With lockless code, when multiple threads us CompareExchange concurrently, at least one will succeed." Isn't it the same with locks? When multiple threads try to enter a critical section, at least one, and only one will succeed. The others will wait their turn.
Its not so much the fact that its lockless but the implementation is slow.. Each enqueue and dequeue here require adding a new class ( which is the node) , a new is VERY expensive and uses a lock around the GC allocator. So you have removed the lock around enqueue and dequeue but hit the worse GC lock. A per formant lock less queue would pre-create and re-use its nodes OR manage its own memory with structs and unsafe. An array based queue would not require new nodes and only needed to new when growing the array and hence would offer much better base performance but is harder to make lock less. secondly you are making it worse with a second interlock. Interlocks are not cheap they cause the pipeline to stall and lock they are cheaper than a full lock , note however most full locks just use an interlock on a variable and block if it is in use , since an enqueue without a new is FAST the lock will not really be active long and the other lock implementations ends up just being an interlock compare. Now adding to a queue is probably a very cheap operation maybe 10 cycles so probably no contention with only 5 threads if they do normal work. ( In a bench mark it generate contention but this is not a real scenario a program would spend very little time doing the enqueue and dequeue.
Ben
-
Its not so much the fact that its lockless but the implementation is slow.. Each enqueue and dequeue here require adding a new class ( which is the node) , a new is VERY expensive and uses a lock around the GC allocator. So you have removed the lock around enqueue and dequeue but hit the worse GC lock. A per formant lock less queue would pre-create and re-use its nodes OR manage its own memory with structs and unsafe. An array based queue would not require new nodes and only needed to new when growing the array and hence would offer much better base performance but is harder to make lock less. secondly you are making it worse with a second interlock. Interlocks are not cheap they cause the pipeline to stall and lock they are cheaper than a full lock , note however most full locks just use an interlock on a variable and block if it is in use , since an enqueue without a new is FAST the lock will not really be active long and the other lock implementations ends up just being an interlock compare. Now adding to a queue is probably a very cheap operation maybe 10 cycles so probably no contention with only 5 threads if they do normal work. ( In a bench mark it generate contention but this is not a real scenario a program would spend very little time doing the enqueue and dequeue.
Ben