Lock free algorithms
-
I've been studying lock free algorithms, specifically the circular lock free ring buffer. I fully understand what they're doing, but am having trouble applying that to some of my code. It seems that the lock free ring buffer is meant for situations where the queue is neither starved nor saturated for any "significant" (in CPU terms) period of time. For example, I recently had some code with multiple producers and a single consumer. When the server was running at a normal load, the lock free ring buffer would have been an excellent solution. However, when the server load was low it seems the consumer would just be spinning, using quite a bit of CPU doing nothing. Contra-wise, when the server load was high, the producers would be spinning (in the current code, if this happens, the producers have a wait of a certain period at which point, they discard their data and grab the next set [losing data periodically was permissible since the next set would allow a full reconstruction, albeit with stutter]), but how long do they spin? Can't use just a count since time is important in this situation, but grabbing clock is expensive and simply looping the thread consumes a lot of CPU, depriving other producers of their quanta. Am I missing something?
-
I've been studying lock free algorithms, specifically the circular lock free ring buffer. I fully understand what they're doing, but am having trouble applying that to some of my code. It seems that the lock free ring buffer is meant for situations where the queue is neither starved nor saturated for any "significant" (in CPU terms) period of time. For example, I recently had some code with multiple producers and a single consumer. When the server was running at a normal load, the lock free ring buffer would have been an excellent solution. However, when the server load was low it seems the consumer would just be spinning, using quite a bit of CPU doing nothing. Contra-wise, when the server load was high, the producers would be spinning (in the current code, if this happens, the producers have a wait of a certain period at which point, they discard their data and grab the next set [losing data periodically was permissible since the next set would allow a full reconstruction, albeit with stutter]), but how long do they spin? Can't use just a count since time is important in this situation, but grabbing clock is expensive and simply looping the thread consumes a lot of CPU, depriving other producers of their quanta. Am I missing something?
I posted a lock-free stack a few days ago. There would be no such issues of spinning/locking up the cpu. The only "problem" with it compared to the circular buffer you're using now is that the stack is FILO instead of FIFO. A Fundamental Lock-Free Building Block - The Lock-Free Stack[^] Hope it helps, lemme know if you have any questions.
-
I posted a lock-free stack a few days ago. There would be no such issues of spinning/locking up the cpu. The only "problem" with it compared to the circular buffer you're using now is that the stack is FILO instead of FIFO. A Fundamental Lock-Free Building Block - The Lock-Free Stack[^] Hope it helps, lemme know if you have any questions.
Michael Gazonda wrote:
There would be no such issues of spinning/locking up the cpu
This is the part I struggle with. If the stack is full, how does the CPU not spin? Likewise, if the stack is empty, pop returns false. Then what?
-
Michael Gazonda wrote:
There would be no such issues of spinning/locking up the cpu
This is the part I struggle with. If the stack is full, how does the CPU not spin? Likewise, if the stack is empty, pop returns false. Then what?
With a stack, there is no "full". It's like a single-linked list. Each item points to the next, and carries with it the space required for doing so. If the stack is empty, and returns false, then that's up to you to handle in whatever way is appropriate. For myself, I wrote this to handle memory allocation (where I would manually allocate on false), or as a messaging system where false just meant there was no work to do and I would wait on a condition_variable.
-
I've been studying lock free algorithms, specifically the circular lock free ring buffer. I fully understand what they're doing, but am having trouble applying that to some of my code. It seems that the lock free ring buffer is meant for situations where the queue is neither starved nor saturated for any "significant" (in CPU terms) period of time. For example, I recently had some code with multiple producers and a single consumer. When the server was running at a normal load, the lock free ring buffer would have been an excellent solution. However, when the server load was low it seems the consumer would just be spinning, using quite a bit of CPU doing nothing. Contra-wise, when the server load was high, the producers would be spinning (in the current code, if this happens, the producers have a wait of a certain period at which point, they discard their data and grab the next set [losing data periodically was permissible since the next set would allow a full reconstruction, albeit with stutter]), but how long do they spin? Can't use just a count since time is important in this situation, but grabbing clock is expensive and simply looping the thread consumes a lot of CPU, depriving other producers of their quanta. Am I missing something?
Why not use one of the many open source message queues out there? This is exactly what they do. I'm assuming, of course, that you are trying to implement a circular buffer to solve a problem and not the other way around :). * a message queue solves the producer / consumer problem for you * a message queue solves the distributed workload problem for you * a message queue can be configured to use either memory / database / files / etc. as a backing store, so you never lose a work request or response * clustering / scaling / fail over, etc. is all built in for you I used Apache ActiveMQ on a project with an arbitrary number of producers and about 400 - 500 consumers (VMWares) and it worked great. Producers would submit work requests and the 400 - 500 consumers would just cycle through the various queues looking for work they knew how to do. The cool thing about ActiveMQ is that it supports push notifications so you don't use any CPU time during idle time.
-
Why not use one of the many open source message queues out there? This is exactly what they do. I'm assuming, of course, that you are trying to implement a circular buffer to solve a problem and not the other way around :). * a message queue solves the producer / consumer problem for you * a message queue solves the distributed workload problem for you * a message queue can be configured to use either memory / database / files / etc. as a backing store, so you never lose a work request or response * clustering / scaling / fail over, etc. is all built in for you I used Apache ActiveMQ on a project with an arbitrary number of producers and about 400 - 500 consumers (VMWares) and it worked great. Producers would submit work requests and the 400 - 500 consumers would just cycle through the various queues looking for work they knew how to do. The cool thing about ActiveMQ is that it supports push notifications so you don't use any CPU time during idle time.
Because I'm trying to understand lock free algorithms. (I former colleague evaluated ActiveMQ and found it didn't scale well. Neither did RabbitMQ. For the previous app I worked on, both would have been completely inadequate.)
-
Because I'm trying to understand lock free algorithms. (I former colleague evaluated ActiveMQ and found it didn't scale well. Neither did RabbitMQ. For the previous app I worked on, both would have been completely inadequate.)
Yeah, that's why I was asking if you were trying to reinvent the wheel :). Just as an aside with ActiveMQ, it actually scales EXTREMELY well, you just need to configure it exactly right and use it exactly right. They have a TON of configuration options which are pretty poorly documented IMO. Just mis-setting one of them can kill your performance. One complaint I do have about it is that it doesn't handle a heavy onslaught of large messages very well. My original implementation tried to send 1MB to 2MB messages and it would crash often (probably a configuration issue as we wanted to slam messages through and didn't care if they got lost). I then started compressing the messages with LZ4 and they went down to about 400k to 500k and the system got pretty damn stable for only having one server and 400 to 500 clients and performance was much better. They are actually tweaking the system right now to send really small messages that point to file shares to pick up the work load. And nothing against you or your ability, but I'd be surprised if you are able to create a production system that can top one of the widely used message queues out there. At least not in a reasonable amount of time. I don't think I could either. That's all those guys do. If you are doing it as a learning process, well, that's a different story :). On the other hand, if you have to deliver a system that's going to scale to 1000 clients and millions of messages / day in a few weeks, well... :).
-
Yeah, that's why I was asking if you were trying to reinvent the wheel :). Just as an aside with ActiveMQ, it actually scales EXTREMELY well, you just need to configure it exactly right and use it exactly right. They have a TON of configuration options which are pretty poorly documented IMO. Just mis-setting one of them can kill your performance. One complaint I do have about it is that it doesn't handle a heavy onslaught of large messages very well. My original implementation tried to send 1MB to 2MB messages and it would crash often (probably a configuration issue as we wanted to slam messages through and didn't care if they got lost). I then started compressing the messages with LZ4 and they went down to about 400k to 500k and the system got pretty damn stable for only having one server and 400 to 500 clients and performance was much better. They are actually tweaking the system right now to send really small messages that point to file shares to pick up the work load. And nothing against you or your ability, but I'd be surprised if you are able to create a production system that can top one of the widely used message queues out there. At least not in a reasonable amount of time. I don't think I could either. That's all those guys do. If you are doing it as a learning process, well, that's a different story :). On the other hand, if you have to deliver a system that's going to scale to 1000 clients and millions of messages / day in a few weeks, well... :).
SledgeHammer01 wrote:
One complaint I do have about it is that it doesn't handle a heavy onslaught of large messages very well
Yup.
SledgeHammer01 wrote:
On the other hand, if you have to deliver a system that's going to scale to 1000 clients and millions of messages / day in a few weeks, well...
What, again, was your definition of scalable? :)
SledgeHammer01 wrote:
And nothing against you or your ability, but I'd be surprised if you are able to create a production system that can top one of the widely used message queues out there. At least not in a reasonable amount of time. I don't think I could either. That's all those guys do.
Actually, I have, though the use was very narrow and specialized. Again, our testing found that all the commercial solutions had performance bottlenecks or just plain fell apart when trying to use them with massive amounts of data. BTW, in one app, switching from ZLib to LZ4 was a significant performance gain that overwhelmed any decrease due to slightly larger message sizes.
-
SledgeHammer01 wrote:
One complaint I do have about it is that it doesn't handle a heavy onslaught of large messages very well
Yup.
SledgeHammer01 wrote:
On the other hand, if you have to deliver a system that's going to scale to 1000 clients and millions of messages / day in a few weeks, well...
What, again, was your definition of scalable? :)
SledgeHammer01 wrote:
And nothing against you or your ability, but I'd be surprised if you are able to create a production system that can top one of the widely used message queues out there. At least not in a reasonable amount of time. I don't think I could either. That's all those guys do.
Actually, I have, though the use was very narrow and specialized. Again, our testing found that all the commercial solutions had performance bottlenecks or just plain fell apart when trying to use them with massive amounts of data. BTW, in one app, switching from ZLib to LZ4 was a significant performance gain that overwhelmed any decrease due to slightly larger message sizes.
Joe Woodbury wrote:
What, again, was your definition of scalable? Smile | :)
I meant if your task was to deliver that kind of scale from a home grown solution in a few weeks, that would be tough. If you managed to do that, that's very impressive.
Joe Woodbury wrote:
Actually, I have, though the use was very narrow and specialized. Again, our testing found that all the commercial solutions had performance bottlenecks or just plain fell apart when trying to use them with massive amounts of data. BTW, in one app, switching from ZLib to LZ4 was a significant performance gain that overwhelmed any decrease due to slightly larger message sizes.
Yes, there are performance penalties due to all the configuration options. A slight aside on this topic... we found with our testing that sending large payloads through a message queue (and we originally had a home grown solution as well) is very inefficient. Say for example, your payload is 1MB files. Your producer needs to: 1) load the file / deserialize it (or just load it as a binary blob) 2) possibly serialize it back to a binary blob, compress it 3) post it to a queue (whether your own or a commercial one) 4) possibly do some preprocessing on the blob before sending it over the wire The client needs to: 5) get the large blob 6) decompress it 7) deserialize it 8) do the work It is more efficient to put the payload on a file share and in the payload just send the UNC path to the file share. This way you get rid of the whole compression / decompression step / multiple serialize / de-serialization steps. In our project, we have to process 4 BILLION 1MB payloads, so... ANYWAYS... back to your original question... Whatever you use, the producer "send" should be instant fire-and-forget and this includes if it needs to get a response. It should do that in an async callback (push notification). There should be pretty much 0% CPU usage when the producer is idle and it should never be blocked. If it has something to submit, it should be able to at any time. The server needs to be able to deal with a flood. Unless your server has a ton of memory and you are guaranteed the clients can keep up at any load, you'd need to have a backing store of some kind when the memory gets filled up. As far as the client side goes, it sounds like you are polling your circular queue for work. NOTHING in your design should be polling. Po
-
Joe Woodbury wrote:
What, again, was your definition of scalable? Smile | :)
I meant if your task was to deliver that kind of scale from a home grown solution in a few weeks, that would be tough. If you managed to do that, that's very impressive.
Joe Woodbury wrote:
Actually, I have, though the use was very narrow and specialized. Again, our testing found that all the commercial solutions had performance bottlenecks or just plain fell apart when trying to use them with massive amounts of data. BTW, in one app, switching from ZLib to LZ4 was a significant performance gain that overwhelmed any decrease due to slightly larger message sizes.
Yes, there are performance penalties due to all the configuration options. A slight aside on this topic... we found with our testing that sending large payloads through a message queue (and we originally had a home grown solution as well) is very inefficient. Say for example, your payload is 1MB files. Your producer needs to: 1) load the file / deserialize it (or just load it as a binary blob) 2) possibly serialize it back to a binary blob, compress it 3) post it to a queue (whether your own or a commercial one) 4) possibly do some preprocessing on the blob before sending it over the wire The client needs to: 5) get the large blob 6) decompress it 7) deserialize it 8) do the work It is more efficient to put the payload on a file share and in the payload just send the UNC path to the file share. This way you get rid of the whole compression / decompression step / multiple serialize / de-serialization steps. In our project, we have to process 4 BILLION 1MB payloads, so... ANYWAYS... back to your original question... Whatever you use, the producer "send" should be instant fire-and-forget and this includes if it needs to get a response. It should do that in an async callback (push notification). There should be pretty much 0% CPU usage when the producer is idle and it should never be blocked. If it has something to submit, it should be able to at any time. The server needs to be able to deal with a flood. Unless your server has a ton of memory and you are guaranteed the clients can keep up at any load, you'd need to have a backing store of some kind when the memory gets filled up. As far as the client side goes, it sounds like you are polling your circular queue for work. NOTHING in your design should be polling. Po
SledgeHammer01 wrote:
As far as the client side goes, it sounds like you are polling your circular queue for work. NOTHING in your design should be polling. Polling is a big no-no. Everything should be async push notifications. Sleep / poll is not a good design. Should be push / WaitForSingleObject. WaitForSingleObject does not consume any CPU resources.
That's the point of my question. When doing a lock-free algorithm, how do you avoid polling (due to starvation or saturation)? In the discussions of the well-known circular queue, this is never actually answered--it assumes that a natural balance will always be present. For the record, my various solutions have used locks.
-
SledgeHammer01 wrote:
As far as the client side goes, it sounds like you are polling your circular queue for work. NOTHING in your design should be polling. Polling is a big no-no. Everything should be async push notifications. Sleep / poll is not a good design. Should be push / WaitForSingleObject. WaitForSingleObject does not consume any CPU resources.
That's the point of my question. When doing a lock-free algorithm, how do you avoid polling (due to starvation or saturation)? In the discussions of the well-known circular queue, this is never actually answered--it assumes that a natural balance will always be present. For the record, my various solutions have used locks.
Joe Woodbury wrote:
That's the point of my question. When doing a lock-free algorithm, how do you avoid polling (due to starvation or saturation)? In the discussions of the well-known circular queue, this is never actually answered--it assumes that a natural balance will always be present.
For the record, my various solutions have used locks.I'm not familiar with your requirements, but a circular buffer wouldn't be my first choice for this (unless your requirement is to discard data when the buffer is full). I would use something more like a queue. But right now your code probably looks like this: while (true) { someClass theTask; if (buffer.GetWork(out theTask)) { DoWork(theTask); } } or something of that nature. That is a *very* tight loop when no work is available and will kill the CPU. A hack in the polling design is to add a sleep to alleviate the pressure on the CPU. You'll still have some CPU usage, but it'll be much less. A drawback would be, of course, that your client couldn't do any work during the sleep time which is a huge waste of CPU time. I.e. if you slept for 1000ms, and you got a message at 10ms in, your client wouldn't do anything for another 990ms. A push / no polling design would be something like where your buffer class exposes a waitable handle such as a mutex. Now your client code looks like: while (true) { WaitForSingleObject(buffer.theMutex); DoWork(buffer.GetWork()); } or something like that. The WaitForSingleObject() will block with 0% CPU usage. It will only return when the mutex is set. So... Your buffer class would set the mutex whenever data is inserted into the queue (in the insert method). The GetWork() method would reset the mutex when there are no remaining items (idle time). Not sure how you are sending over the network, but sockets support async push notifications, so you'd have to make sure you use that as well. In the socket async callback, you'd insert into the queue which would set the mutex.
-
Joe Woodbury wrote:
That's the point of my question. When doing a lock-free algorithm, how do you avoid polling (due to starvation or saturation)? In the discussions of the well-known circular queue, this is never actually answered--it assumes that a natural balance will always be present.
For the record, my various solutions have used locks.I'm not familiar with your requirements, but a circular buffer wouldn't be my first choice for this (unless your requirement is to discard data when the buffer is full). I would use something more like a queue. But right now your code probably looks like this: while (true) { someClass theTask; if (buffer.GetWork(out theTask)) { DoWork(theTask); } } or something of that nature. That is a *very* tight loop when no work is available and will kill the CPU. A hack in the polling design is to add a sleep to alleviate the pressure on the CPU. You'll still have some CPU usage, but it'll be much less. A drawback would be, of course, that your client couldn't do any work during the sleep time which is a huge waste of CPU time. I.e. if you slept for 1000ms, and you got a message at 10ms in, your client wouldn't do anything for another 990ms. A push / no polling design would be something like where your buffer class exposes a waitable handle such as a mutex. Now your client code looks like: while (true) { WaitForSingleObject(buffer.theMutex); DoWork(buffer.GetWork()); } or something like that. The WaitForSingleObject() will block with 0% CPU usage. It will only return when the mutex is set. So... Your buffer class would set the mutex whenever data is inserted into the queue (in the insert method). The GetWork() method would reset the mutex when there are no remaining items (idle time). Not sure how you are sending over the network, but sockets support async push notifications, so you'd have to make sure you use that as well. In the socket async callback, you'd insert into the queue which would set the mutex.
No offense, but I already know what your writing--I've written several very high performance algorithms to solve this--my question is purely about how various lock-free algorithms handle starvation/saturation. Various articles on lock-free algorithms make the bold statement that the obvious polling condition, with accompanying CPU saturation, will never occur but never explain why not.