Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. Other Discussions
  3. The Weird and The Wonderful
  4. I must be doing something wrong...

I must be doing something wrong...

Scheduled Pinned Locked Moved The Weird and The Wonderful
learning
11 Posts 6 Posters 85 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P PIEBALDconsult

    Running a process on one thread -- seven minutes. Splitting the process onto two threads -- six minutes each. Splitting the process onto four threads -- five minutes each. Most of the time spent seems to be the threads waiting on a lock on a shared resource. :sigh: Too much for a Friday afternoon, I'll get back on it on Monday. It must be something really stupid.

    0 Offline
    0 Offline
    0x01AA
    wrote on last edited by
    #2

    Yes, it looks like you do it wrong, sorry. Ok, if the problem is a shared resource then you are not basicaly wrong. But in case you try to use all cpu power from your machine, don't use threads, use parallel proceses.

    1 Reply Last reply
    0
    • P PIEBALDconsult

      Running a process on one thread -- seven minutes. Splitting the process onto two threads -- six minutes each. Splitting the process onto four threads -- five minutes each. Most of the time spent seems to be the threads waiting on a lock on a shared resource. :sigh: Too much for a Friday afternoon, I'll get back on it on Monday. It must be something really stupid.

      T Offline
      T Offline
      trønderen
      wrote on last edited by
      #3

      A study mate of mine went to a job to implement, or rather port, a database system on the VAX. What they had for development and testing was a 730, probably the slowest VAX ever manufactured. They put a large, green "Turtle Wax" sticker on its front panel. My good friend timed a couple operations: To blank fill an 80 character buffer before reading an input to it took, for blank filling alone, 20 milliseconds. No, not microseconds, but milliseconds. A process switch (I am quite sure they were running VAX VMS; I don't think any alternative was available in 1983) took 100 ms., i.e. 1/10 of a second. This DBMS was designed as three processes with a lot of interaction between them. Or, rather, as much as was possible given that process switch time. That wasn't a lot! They had to chop one of the processes into pieces and distribute it on the two other processes, so that the interacting parts were (at least to a much higher degree) running in the same process with no process switch necessary. I don't know the details of how they did it, but suspect that they used a home-made coroutine switching, just loading the stack pointer and program counter of the communication partner to have the message processed. (The project leader was a professor of compiler construction, well versed in register and stack management.) So the answer is sometimes not to increase the number of threads/processes, but exactly the opposite! If you task is 100% CPU bound, and only 1 CPU is available, then introducing the overhead of thread switching will never speed up things. On the other hand, you rarely see 100% CPU bound tasks, and nowadays, you rarely see a machine with only 1 CPU available. For the VAX 730, it seemed like you had significantly less than 1 CPU available :-)

      P G 2 Replies Last reply
      0
      • T trønderen

        A study mate of mine went to a job to implement, or rather port, a database system on the VAX. What they had for development and testing was a 730, probably the slowest VAX ever manufactured. They put a large, green "Turtle Wax" sticker on its front panel. My good friend timed a couple operations: To blank fill an 80 character buffer before reading an input to it took, for blank filling alone, 20 milliseconds. No, not microseconds, but milliseconds. A process switch (I am quite sure they were running VAX VMS; I don't think any alternative was available in 1983) took 100 ms., i.e. 1/10 of a second. This DBMS was designed as three processes with a lot of interaction between them. Or, rather, as much as was possible given that process switch time. That wasn't a lot! They had to chop one of the processes into pieces and distribute it on the two other processes, so that the interacting parts were (at least to a much higher degree) running in the same process with no process switch necessary. I don't know the details of how they did it, but suspect that they used a home-made coroutine switching, just loading the stack pointer and program counter of the communication partner to have the message processed. (The project leader was a professor of compiler construction, well versed in register and stack management.) So the answer is sometimes not to increase the number of threads/processes, but exactly the opposite! If you task is 100% CPU bound, and only 1 CPU is available, then introducing the overhead of thread switching will never speed up things. On the other hand, you rarely see 100% CPU bound tasks, and nowadays, you rarely see a machine with only 1 CPU available. For the VAX 730, it seemed like you had significantly less than 1 CPU available :-)

        P Offline
        P Offline
        PIEBALDconsult
        wrote on last edited by
        #4

        I'm twenty years OpenVMS clean now. (Other than some minor dabbling, which I can't even do now that HP has killed the hobbyist program again.) A similar process I have is IO-bound -- writing the results to disk -- so multi-threading won't help with that one. But this one is writing to the database (SQL Server, SqlBulkCopy). * Read 1000 objects * Prepare each object to write to the database * Write the objects to the database * Repeat as needed (250000+ total objects) I timed the reading and the writing portions, the total time for writing was very low. The reading part is reading DNS data from domain controllers via LDAP, so I was expecting that the reading was taking the time, so I am astonished that I'm not seeing much time spent doing the actual reading. Anyway, it can wait for Monday.

        T 1 Reply Last reply
        0
        • P PIEBALDconsult

          I'm twenty years OpenVMS clean now. (Other than some minor dabbling, which I can't even do now that HP has killed the hobbyist program again.) A similar process I have is IO-bound -- writing the results to disk -- so multi-threading won't help with that one. But this one is writing to the database (SQL Server, SqlBulkCopy). * Read 1000 objects * Prepare each object to write to the database * Write the objects to the database * Repeat as needed (250000+ total objects) I timed the reading and the writing portions, the total time for writing was very low. The reading part is reading DNS data from domain controllers via LDAP, so I was expecting that the reading was taking the time, so I am astonished that I'm not seeing much time spent doing the actual reading. Anyway, it can wait for Monday.

          T Offline
          T Offline
          trønderen
          wrote on last edited by
          #5

          Obviously, your DNS server (the one you are reading from) can only handle a single request at a time. Somewhere behind the curtain, to do a read you must get hold of a semaphore. It may be so much behind the curtain that it is called a critical region or monitor, but in any case, it boils down to gaining control of a semaphore. Maybe the semaphore handling (or region / monitor entering and leaving) really takes a lot of time, but your single-thread solution doesn't notice because it has obtained the data and is buy writing it to the database. Yet, the one(s) waiting in line wont obtain the semaphore until all the behind-the-scenes release work is completed. If the waiting process has to poll for the semaphore, he won't get it immediately when it is freed. Maybe he comes a second or two later, and that time is lost. If it happens for every LDAP request, it adds up! Note that everywhere but in the *nix community, programmers knew of queueing semaphore (so you didn't have to poll), regions and monitors based on queueing semaphores, from the mid/late 1970s. *nix had nothing like it, except "Let us create a file, whose existence indicates 'resource reserved'!" It is an extremely resource demanding semaphore, compared to OS implementations, and it is binary (queue-less), so you have to poll it. *nix programmers grew up without knowing anything about proper synchronization. When finally OS based semaphores where introduced to *nix, they were first binary; you still had to do the polling. And *nix programmers were very reluctant to start using them. Even today, it seems like young programmers know very little about proper use of semaphores, regions, and monitors, and regularly use binary semaphores as something like a voluntary Post-It: 'Please be so kind not to touch these data while I am working on them, will you?' Both putting up the Post-It and for others to read it is sort of voluntary. (I learned to build non-voluntary, efficient queueing semaphores, critical regions and monitors from the 1973 Brinch Hansen book 'Operating System Principles - it is all there, from 50 years ago. Obviously, not a single *nix programmer I have ever met has as much as heard of that book. Or any other describing the same solutions.) So my guess is that you have come across a really poor implementation of resource protection, probably based on polling a binary semaphore. When the second process finally gets around to grabbing it, the first one is almost ready to enter his next round, a split second too late; #2 tool the semap

          P 1 Reply Last reply
          0
          • T trønderen

            Obviously, your DNS server (the one you are reading from) can only handle a single request at a time. Somewhere behind the curtain, to do a read you must get hold of a semaphore. It may be so much behind the curtain that it is called a critical region or monitor, but in any case, it boils down to gaining control of a semaphore. Maybe the semaphore handling (or region / monitor entering and leaving) really takes a lot of time, but your single-thread solution doesn't notice because it has obtained the data and is buy writing it to the database. Yet, the one(s) waiting in line wont obtain the semaphore until all the behind-the-scenes release work is completed. If the waiting process has to poll for the semaphore, he won't get it immediately when it is freed. Maybe he comes a second or two later, and that time is lost. If it happens for every LDAP request, it adds up! Note that everywhere but in the *nix community, programmers knew of queueing semaphore (so you didn't have to poll), regions and monitors based on queueing semaphores, from the mid/late 1970s. *nix had nothing like it, except "Let us create a file, whose existence indicates 'resource reserved'!" It is an extremely resource demanding semaphore, compared to OS implementations, and it is binary (queue-less), so you have to poll it. *nix programmers grew up without knowing anything about proper synchronization. When finally OS based semaphores where introduced to *nix, they were first binary; you still had to do the polling. And *nix programmers were very reluctant to start using them. Even today, it seems like young programmers know very little about proper use of semaphores, regions, and monitors, and regularly use binary semaphores as something like a voluntary Post-It: 'Please be so kind not to touch these data while I am working on them, will you?' Both putting up the Post-It and for others to read it is sort of voluntary. (I learned to build non-voluntary, efficient queueing semaphores, critical regions and monitors from the 1973 Brinch Hansen book 'Operating System Principles - it is all there, from 50 years ago. Obviously, not a single *nix programmer I have ever met has as much as heard of that book. Or any other describing the same solutions.) So my guess is that you have come across a really poor implementation of resource protection, probably based on polling a binary semaphore. When the second process finally gets around to grabbing it, the first one is almost ready to enter his next round, a split second too late; #2 tool the semap

            P Offline
            P Offline
            PIEBALDconsult
            wrote on last edited by
            #6

            trønderen wrote:

            Obviously, your DNS server (the one you are reading from) can only handle a single request at a time

            No, that isn't the case. The domain servers can serve concurrent requests. And anyway, this is only one request -- which returns 250000+ objects in 1000 object pages. I did expect that the bottle neck would be in retrieving each object (or page), but my tests haven't shown that -- probably invalid tests. All I'm seeing is that lock ( thingy ) (in C#) seems to be taking a lot of time. I have even tried to reduce the number of locks by locking once and reading 1000 objects before releasing the lock, but that didn't help either. Anyway, this post isn't about trying to find out what I did wrong.

            1 Reply Last reply
            0
            • T trønderen

              A study mate of mine went to a job to implement, or rather port, a database system on the VAX. What they had for development and testing was a 730, probably the slowest VAX ever manufactured. They put a large, green "Turtle Wax" sticker on its front panel. My good friend timed a couple operations: To blank fill an 80 character buffer before reading an input to it took, for blank filling alone, 20 milliseconds. No, not microseconds, but milliseconds. A process switch (I am quite sure they were running VAX VMS; I don't think any alternative was available in 1983) took 100 ms., i.e. 1/10 of a second. This DBMS was designed as three processes with a lot of interaction between them. Or, rather, as much as was possible given that process switch time. That wasn't a lot! They had to chop one of the processes into pieces and distribute it on the two other processes, so that the interacting parts were (at least to a much higher degree) running in the same process with no process switch necessary. I don't know the details of how they did it, but suspect that they used a home-made coroutine switching, just loading the stack pointer and program counter of the communication partner to have the message processed. (The project leader was a professor of compiler construction, well versed in register and stack management.) So the answer is sometimes not to increase the number of threads/processes, but exactly the opposite! If you task is 100% CPU bound, and only 1 CPU is available, then introducing the overhead of thread switching will never speed up things. On the other hand, you rarely see 100% CPU bound tasks, and nowadays, you rarely see a machine with only 1 CPU available. For the VAX 730, it seemed like you had significantly less than 1 CPU available :-)

              G Offline
              G Offline
              GuyThiebaut
              wrote on last edited by
              #7

              trønderen wrote:

              you rarely see 100% CPU bound tasks

              Where I work a number of us devs are regularly experiencing 100% CPU usage with the Windows for Linux Subsystem process so we have to end that process and allow Docker to restart it - the joys of Docker...

              “That which can be asserted without evidence, can be dismissed without evidence.”

              ― Christopher Hitchens

              1 Reply Last reply
              0
              • P PIEBALDconsult

                Running a process on one thread -- seven minutes. Splitting the process onto two threads -- six minutes each. Splitting the process onto four threads -- five minutes each. Most of the time spent seems to be the threads waiting on a lock on a shared resource. :sigh: Too much for a Friday afternoon, I'll get back on it on Monday. It must be something really stupid.

                D Offline
                D Offline
                Daniel Pfeffer
                wrote on last edited by
                #8

                If the shared resource is something like a queue, try a [lock-free](https://en.wikipedia.org/wiki/Non-blocking\_algorithm) solution. This eliminates much of the problems involved with synchronization between threads, at the cost of much more complicated algorithms.

                Freedom is the freedom to say that two plus two make four. If that is granted, all else follows. -- 6079 Smith W.

                P 1 Reply Last reply
                0
                • D Daniel Pfeffer

                  If the shared resource is something like a queue, try a [lock-free](https://en.wikipedia.org/wiki/Non-blocking\_algorithm) solution. This eliminates much of the problems involved with synchronization between threads, at the cost of much more complicated algorithms.

                  Freedom is the freedom to say that two plus two make four. If that is granted, all else follows. -- 6079 Smith W.

                  P Offline
                  P Offline
                  PIEBALDconsult
                  wrote on last edited by
                  #9

                  It is. I'll have to have a look at that.

                  D 1 Reply Last reply
                  0
                  • P PIEBALDconsult

                    It is. I'll have to have a look at that.

                    D Offline
                    D Offline
                    Daniel Pfeffer
                    wrote on last edited by
                    #10

                    Here's a link to a paper about them. There may be better solutions out there... [https://www.cs.rochester.edu/~scott/papers/1996\_PODC\_queues.pdf\](https://www.cs.rochester.edu/~scott/papers/1996\_PODC\_queues.pdf)

                    Freedom is the freedom to say that two plus two make four. If that is granted, all else follows. -- 6079 Smith W.

                    1 Reply Last reply
                    0
                    • P PIEBALDconsult

                      Running a process on one thread -- seven minutes. Splitting the process onto two threads -- six minutes each. Splitting the process onto four threads -- five minutes each. Most of the time spent seems to be the threads waiting on a lock on a shared resource. :sigh: Too much for a Friday afternoon, I'll get back on it on Monday. It must be something really stupid.

                      B Offline
                      B Offline
                      Bohdan Stupak
                      wrote on last edited by
                      #11

                      Amdahl's law[^] is there for a reason

                      1 Reply Last reply
                      0
                      Reply
                      • Reply as topic
                      Log in to reply
                      • Oldest to Newest
                      • Newest to Oldest
                      • Most Votes


                      • Login

                      • Don't have an account? Register

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • World
                      • Users
                      • Groups