I must be doing something wrong...
-
Running a process on one thread -- seven minutes. Splitting the process onto two threads -- six minutes each. Splitting the process onto four threads -- five minutes each. Most of the time spent seems to be the threads waiting on a lock on a shared resource. :sigh: Too much for a Friday afternoon, I'll get back on it on Monday. It must be something really stupid.
-
Running a process on one thread -- seven minutes. Splitting the process onto two threads -- six minutes each. Splitting the process onto four threads -- five minutes each. Most of the time spent seems to be the threads waiting on a lock on a shared resource. :sigh: Too much for a Friday afternoon, I'll get back on it on Monday. It must be something really stupid.
A study mate of mine went to a job to implement, or rather port, a database system on the VAX. What they had for development and testing was a 730, probably the slowest VAX ever manufactured. They put a large, green "Turtle Wax" sticker on its front panel. My good friend timed a couple operations: To blank fill an 80 character buffer before reading an input to it took, for blank filling alone, 20 milliseconds. No, not microseconds, but milliseconds. A process switch (I am quite sure they were running VAX VMS; I don't think any alternative was available in 1983) took 100 ms., i.e. 1/10 of a second. This DBMS was designed as three processes with a lot of interaction between them. Or, rather, as much as was possible given that process switch time. That wasn't a lot! They had to chop one of the processes into pieces and distribute it on the two other processes, so that the interacting parts were (at least to a much higher degree) running in the same process with no process switch necessary. I don't know the details of how they did it, but suspect that they used a home-made coroutine switching, just loading the stack pointer and program counter of the communication partner to have the message processed. (The project leader was a professor of compiler construction, well versed in register and stack management.) So the answer is sometimes not to increase the number of threads/processes, but exactly the opposite! If you task is 100% CPU bound, and only 1 CPU is available, then introducing the overhead of thread switching will never speed up things. On the other hand, you rarely see 100% CPU bound tasks, and nowadays, you rarely see a machine with only 1 CPU available. For the VAX 730, it seemed like you had significantly less than 1 CPU available :-)
-
A study mate of mine went to a job to implement, or rather port, a database system on the VAX. What they had for development and testing was a 730, probably the slowest VAX ever manufactured. They put a large, green "Turtle Wax" sticker on its front panel. My good friend timed a couple operations: To blank fill an 80 character buffer before reading an input to it took, for blank filling alone, 20 milliseconds. No, not microseconds, but milliseconds. A process switch (I am quite sure they were running VAX VMS; I don't think any alternative was available in 1983) took 100 ms., i.e. 1/10 of a second. This DBMS was designed as three processes with a lot of interaction between them. Or, rather, as much as was possible given that process switch time. That wasn't a lot! They had to chop one of the processes into pieces and distribute it on the two other processes, so that the interacting parts were (at least to a much higher degree) running in the same process with no process switch necessary. I don't know the details of how they did it, but suspect that they used a home-made coroutine switching, just loading the stack pointer and program counter of the communication partner to have the message processed. (The project leader was a professor of compiler construction, well versed in register and stack management.) So the answer is sometimes not to increase the number of threads/processes, but exactly the opposite! If you task is 100% CPU bound, and only 1 CPU is available, then introducing the overhead of thread switching will never speed up things. On the other hand, you rarely see 100% CPU bound tasks, and nowadays, you rarely see a machine with only 1 CPU available. For the VAX 730, it seemed like you had significantly less than 1 CPU available :-)
I'm twenty years OpenVMS clean now. (Other than some minor dabbling, which I can't even do now that HP has killed the hobbyist program again.) A similar process I have is IO-bound -- writing the results to disk -- so multi-threading won't help with that one. But this one is writing to the database (SQL Server, SqlBulkCopy). * Read 1000 objects * Prepare each object to write to the database * Write the objects to the database * Repeat as needed (250000+ total objects) I timed the reading and the writing portions, the total time for writing was very low. The reading part is reading DNS data from domain controllers via LDAP, so I was expecting that the reading was taking the time, so I am astonished that I'm not seeing much time spent doing the actual reading. Anyway, it can wait for Monday.
-
I'm twenty years OpenVMS clean now. (Other than some minor dabbling, which I can't even do now that HP has killed the hobbyist program again.) A similar process I have is IO-bound -- writing the results to disk -- so multi-threading won't help with that one. But this one is writing to the database (SQL Server, SqlBulkCopy). * Read 1000 objects * Prepare each object to write to the database * Write the objects to the database * Repeat as needed (250000+ total objects) I timed the reading and the writing portions, the total time for writing was very low. The reading part is reading DNS data from domain controllers via LDAP, so I was expecting that the reading was taking the time, so I am astonished that I'm not seeing much time spent doing the actual reading. Anyway, it can wait for Monday.
Obviously, your DNS server (the one you are reading from) can only handle a single request at a time. Somewhere behind the curtain, to do a read you must get hold of a semaphore. It may be so much behind the curtain that it is called a critical region or monitor, but in any case, it boils down to gaining control of a semaphore. Maybe the semaphore handling (or region / monitor entering and leaving) really takes a lot of time, but your single-thread solution doesn't notice because it has obtained the data and is buy writing it to the database. Yet, the one(s) waiting in line wont obtain the semaphore until all the behind-the-scenes release work is completed. If the waiting process has to poll for the semaphore, he won't get it immediately when it is freed. Maybe he comes a second or two later, and that time is lost. If it happens for every LDAP request, it adds up! Note that everywhere but in the *nix community, programmers knew of queueing semaphore (so you didn't have to poll), regions and monitors based on queueing semaphores, from the mid/late 1970s. *nix had nothing like it, except "Let us create a file, whose existence indicates 'resource reserved'!" It is an extremely resource demanding semaphore, compared to OS implementations, and it is binary (queue-less), so you have to poll it. *nix programmers grew up without knowing anything about proper synchronization. When finally OS based semaphores where introduced to *nix, they were first binary; you still had to do the polling. And *nix programmers were very reluctant to start using them. Even today, it seems like young programmers know very little about proper use of semaphores, regions, and monitors, and regularly use binary semaphores as something like a voluntary Post-It: 'Please be so kind not to touch these data while I am working on them, will you?' Both putting up the Post-It and for others to read it is sort of voluntary. (I learned to build non-voluntary, efficient queueing semaphores, critical regions and monitors from the 1973 Brinch Hansen book 'Operating System Principles - it is all there, from 50 years ago. Obviously, not a single *nix programmer I have ever met has as much as heard of that book. Or any other describing the same solutions.) So my guess is that you have come across a really poor implementation of resource protection, probably based on polling a binary semaphore. When the second process finally gets around to grabbing it, the first one is almost ready to enter his next round, a split second too late; #2 tool the semap
-
Obviously, your DNS server (the one you are reading from) can only handle a single request at a time. Somewhere behind the curtain, to do a read you must get hold of a semaphore. It may be so much behind the curtain that it is called a critical region or monitor, but in any case, it boils down to gaining control of a semaphore. Maybe the semaphore handling (or region / monitor entering and leaving) really takes a lot of time, but your single-thread solution doesn't notice because it has obtained the data and is buy writing it to the database. Yet, the one(s) waiting in line wont obtain the semaphore until all the behind-the-scenes release work is completed. If the waiting process has to poll for the semaphore, he won't get it immediately when it is freed. Maybe he comes a second or two later, and that time is lost. If it happens for every LDAP request, it adds up! Note that everywhere but in the *nix community, programmers knew of queueing semaphore (so you didn't have to poll), regions and monitors based on queueing semaphores, from the mid/late 1970s. *nix had nothing like it, except "Let us create a file, whose existence indicates 'resource reserved'!" It is an extremely resource demanding semaphore, compared to OS implementations, and it is binary (queue-less), so you have to poll it. *nix programmers grew up without knowing anything about proper synchronization. When finally OS based semaphores where introduced to *nix, they were first binary; you still had to do the polling. And *nix programmers were very reluctant to start using them. Even today, it seems like young programmers know very little about proper use of semaphores, regions, and monitors, and regularly use binary semaphores as something like a voluntary Post-It: 'Please be so kind not to touch these data while I am working on them, will you?' Both putting up the Post-It and for others to read it is sort of voluntary. (I learned to build non-voluntary, efficient queueing semaphores, critical regions and monitors from the 1973 Brinch Hansen book 'Operating System Principles - it is all there, from 50 years ago. Obviously, not a single *nix programmer I have ever met has as much as heard of that book. Or any other describing the same solutions.) So my guess is that you have come across a really poor implementation of resource protection, probably based on polling a binary semaphore. When the second process finally gets around to grabbing it, the first one is almost ready to enter his next round, a split second too late; #2 tool the semap
trønderen wrote:
Obviously, your DNS server (the one you are reading from) can only handle a single request at a time
No, that isn't the case. The domain servers can serve concurrent requests. And anyway, this is only one request -- which returns 250000+ objects in 1000 object pages. I did expect that the bottle neck would be in retrieving each object (or page), but my tests haven't shown that -- probably invalid tests. All I'm seeing is that
lock ( thingy )
(in C#) seems to be taking a lot of time. I have even tried to reduce the number of locks by locking once and reading 1000 objects before releasing the lock, but that didn't help either. Anyway, this post isn't about trying to find out what I did wrong. -
A study mate of mine went to a job to implement, or rather port, a database system on the VAX. What they had for development and testing was a 730, probably the slowest VAX ever manufactured. They put a large, green "Turtle Wax" sticker on its front panel. My good friend timed a couple operations: To blank fill an 80 character buffer before reading an input to it took, for blank filling alone, 20 milliseconds. No, not microseconds, but milliseconds. A process switch (I am quite sure they were running VAX VMS; I don't think any alternative was available in 1983) took 100 ms., i.e. 1/10 of a second. This DBMS was designed as three processes with a lot of interaction between them. Or, rather, as much as was possible given that process switch time. That wasn't a lot! They had to chop one of the processes into pieces and distribute it on the two other processes, so that the interacting parts were (at least to a much higher degree) running in the same process with no process switch necessary. I don't know the details of how they did it, but suspect that they used a home-made coroutine switching, just loading the stack pointer and program counter of the communication partner to have the message processed. (The project leader was a professor of compiler construction, well versed in register and stack management.) So the answer is sometimes not to increase the number of threads/processes, but exactly the opposite! If you task is 100% CPU bound, and only 1 CPU is available, then introducing the overhead of thread switching will never speed up things. On the other hand, you rarely see 100% CPU bound tasks, and nowadays, you rarely see a machine with only 1 CPU available. For the VAX 730, it seemed like you had significantly less than 1 CPU available :-)
trønderen wrote:
you rarely see 100% CPU bound tasks
Where I work a number of us devs are regularly experiencing 100% CPU usage with the Windows for Linux Subsystem process so we have to end that process and allow Docker to restart it - the joys of Docker...
“That which can be asserted without evidence, can be dismissed without evidence.”
― Christopher Hitchens
-
Running a process on one thread -- seven minutes. Splitting the process onto two threads -- six minutes each. Splitting the process onto four threads -- five minutes each. Most of the time spent seems to be the threads waiting on a lock on a shared resource. :sigh: Too much for a Friday afternoon, I'll get back on it on Monday. It must be something really stupid.
If the shared resource is something like a queue, try a [lock-free](https://en.wikipedia.org/wiki/Non-blocking\_algorithm) solution. This eliminates much of the problems involved with synchronization between threads, at the cost of much more complicated algorithms.
Freedom is the freedom to say that two plus two make four. If that is granted, all else follows. -- 6079 Smith W.
-
If the shared resource is something like a queue, try a [lock-free](https://en.wikipedia.org/wiki/Non-blocking\_algorithm) solution. This eliminates much of the problems involved with synchronization between threads, at the cost of much more complicated algorithms.
Freedom is the freedom to say that two plus two make four. If that is granted, all else follows. -- 6079 Smith W.
It is. I'll have to have a look at that.
-
It is. I'll have to have a look at that.
Here's a link to a paper about them. There may be better solutions out there... [https://www.cs.rochester.edu/~scott/papers/1996\_PODC\_queues.pdf\](https://www.cs.rochester.edu/~scott/papers/1996\_PODC\_queues.pdf)
Freedom is the freedom to say that two plus two make four. If that is granted, all else follows. -- 6079 Smith W.
-
Running a process on one thread -- seven minutes. Splitting the process onto two threads -- six minutes each. Splitting the process onto four threads -- five minutes each. Most of the time spent seems to be the threads waiting on a lock on a shared resource. :sigh: Too much for a Friday afternoon, I'll get back on it on Monday. It must be something really stupid.
Amdahl's law[^] is there for a reason