Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. The Lounge
  3. A piece of criticism on "to go fast, do less"

A piece of criticism on "to go fast, do less"

Scheduled Pinned Locked Moved The Lounge
cssdata-structuresarchitecturetutoriallearning
41 Posts 17 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • OriginalGriffO OriginalGriff

    Very true: the days when you could optimise code or data just based on the actual instructions and their relative cycle times are long gone! Level 1 & 2 caches, pipelines and multiple core execution make a lot more difference than raw processor power. Ah! For the days when you could double the performance of your PC just by changing the HDD interleave... That used to be my "Friday-afternoon-look-busy" job - while I played minesweeper :-O

    Real men don't use instructions. They are only the manufacturers opinion on how to put the thing together. Manfred R. Bihy: "Looks as if OP is learning resistant."

    D Offline
    D Offline
    David1987
    wrote on last edited by
    #3

    Or the days when self modifying code was a Good Thing and you had no multiply instruction :)

    OriginalGriffO B 2 Replies Last reply
    0
    • D David1987

      Or the days when self modifying code was a Good Thing and you had no multiply instruction :)

      OriginalGriffO Offline
      OriginalGriffO Offline
      OriginalGriff
      wrote on last edited by
      #4

      No, no. I never wrote self modifying code. Not even to allow me a dynamic bit set instruction. No. Not me. I wouldn't do such a thing. Well, someone else must have put my name on the change logs.

      Real men don't use instructions. They are only the manufacturers opinion on how to put the thing together. Manfred R. Bihy: "Looks as if OP is learning resistant."

      "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
      "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt

      1 Reply Last reply
      0
      • D David1987

        Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

        G Offline
        G Offline
        Gary Wheeler
        wrote on last edited by
        #5

        "To go fast, do less" still works as a general rule if you are optimizing the performance of an algorithm. Unless you're working down on the bare metal (drivers, embedded apps, implementing a runtime), I don't think taking processor considerations into account is effective for most applications.

        Software Zen: delete this;

        D 1 Reply Last reply
        0
        • G Gary Wheeler

          "To go fast, do less" still works as a general rule if you are optimizing the performance of an algorithm. Unless you're working down on the bare metal (drivers, embedded apps, implementing a runtime), I don't think taking processor considerations into account is effective for most applications.

          Software Zen: delete this;

          D Offline
          D Offline
          David1987
          wrote on last edited by
          #6

          As a general rule, sure. Doing unnecessary work isn't the way to get top performance :) Still it will turn out that QuickSort is usually faster than MergeSort even though it "shouldn't be", and worse, Insertion Sort is faster for short lists - the reasons for that are "processor considerations", but should that be a reason to choose the slower algorithm anyway? I actually had to take processor considerations into account when working on a badminton tournament manager - and I'm sure it happens all the time in games, media transcoding, and compression applications.

          G 1 Reply Last reply
          0
          • D David1987

            Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

            G Offline
            G Offline
            Guyverthree
            wrote on last edited by
            #7

            In terms of mobile apps such as those on windows mobile devices the processor if often the bottle neck and I say this from experience. But it's not just raw computation that you can do less with, you can use the stack to do less such as removing repeated pointer indirection.

            James Binary Warrior.

            1 Reply Last reply
            0
            • D David1987

              Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

              P Offline
              P Offline
              peterchen
              wrote on last edited by
              #8

              That was the first that came to my mind when reading that: CPU is not the bottleneck anymore. However, the core idea still holds: To be fast, do less out-of-L1-cache access.

              FILETIME to time_t
              | FoldWithUs! | sighist | WhoIncludes - Analyzing C++ include file hierarchy

              1 Reply Last reply
              0
              • D David1987

                As a general rule, sure. Doing unnecessary work isn't the way to get top performance :) Still it will turn out that QuickSort is usually faster than MergeSort even though it "shouldn't be", and worse, Insertion Sort is faster for short lists - the reasons for that are "processor considerations", but should that be a reason to choose the slower algorithm anyway? I actually had to take processor considerations into account when working on a badminton tournament manager - and I'm sure it happens all the time in games, media transcoding, and compression applications.

                G Offline
                G Offline
                Gary Wheeler
                wrote on last edited by
                #9

                David1987 wrote:

                badminton tournament manager

                :omg: Imagine what you'd have to do for jai alai[^] :-D.

                David1987 wrote:

                I'm sure it happens all the time in games, media transcoding, and compression applications

                It's interesting in my line of work; we make commercial ink-jet printers. We print fully variable data up to 18 inches wide, at 600x600 resolution, at up to 1000 feet of paper per minute, in full process color. If you do the math, that translates to a gegrundous amount of data per second. I'll admit, some of this is accomplished with custom hardware. We've discovered, however, that some algorithm issues are far more significant than low level considerations. A fair number of the performance problems we've dealt with in the past were in the part of the product that communicates with the host, rather than the part that translates incoming data into bitmaps or moves bitmaps out to the hardware. You'd think the part that handled all that data would be the problem. Improving our algorithms for handling host communications (network protocols, exception conditions, multiple connections and threading issues) have gained us significant performance benefits.

                Software Zen: delete this;

                P 1 Reply Last reply
                0
                • G Gary Wheeler

                  David1987 wrote:

                  badminton tournament manager

                  :omg: Imagine what you'd have to do for jai alai[^] :-D.

                  David1987 wrote:

                  I'm sure it happens all the time in games, media transcoding, and compression applications

                  It's interesting in my line of work; we make commercial ink-jet printers. We print fully variable data up to 18 inches wide, at 600x600 resolution, at up to 1000 feet of paper per minute, in full process color. If you do the math, that translates to a gegrundous amount of data per second. I'll admit, some of this is accomplished with custom hardware. We've discovered, however, that some algorithm issues are far more significant than low level considerations. A fair number of the performance problems we've dealt with in the past were in the part of the product that communicates with the host, rather than the part that translates incoming data into bitmaps or moves bitmaps out to the hardware. You'd think the part that handled all that data would be the problem. Improving our algorithms for handling host communications (network protocols, exception conditions, multiple connections and threading issues) have gained us significant performance benefits.

                  Software Zen: delete this;

                  P Offline
                  P Offline
                  PIEBALDconsult
                  wrote on last edited by
                  #10

                  Gary Wheeler wrote:

                  jai alai

                  We don't have the balls for that.

                  1 Reply Last reply
                  0
                  • D David1987

                    Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

                    P Offline
                    P Offline
                    PIEBALDconsult
                    wrote on last edited by
                    #11

                    But remember that your app isn't the only one in town and should be a good neighbor.

                    1 Reply Last reply
                    0
                    • D David1987

                      Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

                      K Offline
                      K Offline
                      Klaus Werner Konrad
                      wrote on last edited by
                      #12

                      Hmm... may be I'm not up to date, but do you remember the thread 'Why I will not hire .Net programmers' ? If you have to work with such complex frameworks, there is THIS much that you have no control about, that 'cache misses' and so on doesn't seem controllable. And AFAIK all .Net languages don't create real machine code, but code for MSIL (wich is some sort of interpreted = at runtime translated code), so what can you optimize there ? The article describes clearly that (especially inside loops) you should do only the least required operations; everything else (like precautions etc.) should be performed outside the loop. Of course, the compilers today can help you in many ways by self-optimizing the code, but if you optimize it by yourself in before, you 1) can be (almost) sure, that is is optimal 2) document the optimization to the guy that must maintenance your code 3) don't rely on compiler options and -versions Also sometimes (like me now) you have to struggle with development systems like TeamDeveloper, that don't utilize multiple cores, multiple threads etc. - then every single instruction counts REALLY

                      D 1 Reply Last reply
                      0
                      • K Klaus Werner Konrad

                        Hmm... may be I'm not up to date, but do you remember the thread 'Why I will not hire .Net programmers' ? If you have to work with such complex frameworks, there is THIS much that you have no control about, that 'cache misses' and so on doesn't seem controllable. And AFAIK all .Net languages don't create real machine code, but code for MSIL (wich is some sort of interpreted = at runtime translated code), so what can you optimize there ? The article describes clearly that (especially inside loops) you should do only the least required operations; everything else (like precautions etc.) should be performed outside the loop. Of course, the compilers today can help you in many ways by self-optimizing the code, but if you optimize it by yourself in before, you 1) can be (almost) sure, that is is optimal 2) document the optimization to the guy that must maintenance your code 3) don't rely on compiler options and -versions Also sometimes (like me now) you have to struggle with development systems like TeamDeveloper, that don't utilize multiple cores, multiple threads etc. - then every single instruction counts REALLY

                        D Offline
                        D Offline
                        David1987
                        wrote on last edited by
                        #13

                        Ah yes if you're using frameworks there's not much you can do.. I've found that, in general, MSIL compiles to native code in way that is often predictable (when it concerns native types such as int etc), generally the code is not reordered in any way so you can still control the fine-grained parallelism somewhat - and for example the multiple accumulator trick still works fine :)

                        K 1 Reply Last reply
                        0
                        • D David1987

                          Ah yes if you're using frameworks there's not much you can do.. I've found that, in general, MSIL compiles to native code in way that is often predictable (when it concerns native types such as int etc), generally the code is not reordered in any way so you can still control the fine-grained parallelism somewhat - and for example the multiple accumulator trick still works fine :)

                          K Offline
                          K Offline
                          Klaus Werner Konrad
                          wrote on last edited by
                          #14

                          "Ah yes" - ah yes, cou can concern that 'in general' MSIL 'compiles' your code ... BUT - that's not my point - it was that you even CANNOT RELY on that it (the framework) optimizes your code - and I have a doubt that it even can do ANY optimizations, 'cause the tryto optimize would effect the average performance ...

                          D 1 Reply Last reply
                          0
                          • K Klaus Werner Konrad

                            "Ah yes" - ah yes, cou can concern that 'in general' MSIL 'compiles' your code ... BUT - that's not my point - it was that you even CANNOT RELY on that it (the framework) optimizes your code - and I have a doubt that it even can do ANY optimizations, 'cause the tryto optimize would effect the average performance ...

                            D Offline
                            D Offline
                            David1987
                            wrote on last edited by
                            #15

                            It does inlining and loop-invariant code motion, as for scalar optimization, well, it optimizes multiplications by small constants and division by constants, that's about it..

                            1 Reply Last reply
                            0
                            • D David1987

                              Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

                              M Offline
                              M Offline
                              mirdones
                              wrote on last edited by
                              #16

                              So you are suposing everyone is using a computer arquitecture that has cache, multi-core and so on. Many people actually do not work on pcs or macs, but with embedded devices, so cpu instructions still count a lot. But indeed, to go fast, do smart is the best advice. But to do smart you need to know the arquitecture you are working in. There is no silver bullet.

                              D 1 Reply Last reply
                              0
                              • M mirdones

                                So you are suposing everyone is using a computer arquitecture that has cache, multi-core and so on. Many people actually do not work on pcs or macs, but with embedded devices, so cpu instructions still count a lot. But indeed, to go fast, do smart is the best advice. But to do smart you need to know the arquitecture you are working in. There is no silver bullet.

                                D Offline
                                D Offline
                                David1987
                                wrote on last edited by
                                #17

                                I program for a good old z80 too sometimes, but surely that's not where the majority of programming happens?

                                M 1 Reply Last reply
                                0
                                • D David1987

                                  Or the days when self modifying code was a Good Thing and you had no multiply instruction :)

                                  B Offline
                                  B Offline
                                  BrainiacV
                                  wrote on last edited by
                                  #18

                                  There was never a day when self-modifying code was a good idea. Many a time I'd be tripped up by changing a branch statement and then have that branch statement be overwritten by a programmer in another module because he thought he was "Oh so clever" by replacing conditional branches with hard branches and NOPs to make it run faster. I ranked that with those that would jump into the middle of a multibyte instruction, expecting it to have the effect of a NOP and thereby save themselves 1 byte. I sorta see their logic, it was during the time when we had to jump up and down on those 1KB EPROMs to make the program fit, but I always thought the better solution was to restructure the code and be more clever than to resort to tricks that were unmaintainable.

                                  Psychosis at 10 Film at 11

                                  D E 2 Replies Last reply
                                  0
                                  • B BrainiacV

                                    There was never a day when self-modifying code was a good idea. Many a time I'd be tripped up by changing a branch statement and then have that branch statement be overwritten by a programmer in another module because he thought he was "Oh so clever" by replacing conditional branches with hard branches and NOPs to make it run faster. I ranked that with those that would jump into the middle of a multibyte instruction, expecting it to have the effect of a NOP and thereby save themselves 1 byte. I sorta see their logic, it was during the time when we had to jump up and down on those 1KB EPROMs to make the program fit, but I always thought the better solution was to restructure the code and be more clever than to resort to tricks that were unmaintainable.

                                    Psychosis at 10 Film at 11

                                    D Offline
                                    D Offline
                                    David1987
                                    wrote on last edited by
                                    #19

                                    There were processors on which SMC was required in order to do certain normal things, such as returning from a subroutine. More often, I've used SMC because things such as bit-tests would only take a constant, and having 8 copies would have been an unacceptable waste of space. Somewhat less often, I've had to resort to SMC like this:

                                    ld hl,AddressOfLock
                                    call _jphl ; jp (hl) at that label
                                    ..

                                    AddressOfLock:
                                    or a
                                    ld a,$C9 ; C9 = opcode for ret
                                    inc hl
                                    inc hl
                                    ld (hl),a
                                    scf
                                    ret

                                    Which is as far as I know the only way to do that atomically on a z80

                                    O B 2 Replies Last reply
                                    0
                                    • D David1987

                                      There were processors on which SMC was required in order to do certain normal things, such as returning from a subroutine. More often, I've used SMC because things such as bit-tests would only take a constant, and having 8 copies would have been an unacceptable waste of space. Somewhat less often, I've had to resort to SMC like this:

                                      ld hl,AddressOfLock
                                      call _jphl ; jp (hl) at that label
                                      ..

                                      AddressOfLock:
                                      or a
                                      ld a,$C9 ; C9 = opcode for ret
                                      inc hl
                                      inc hl
                                      ld (hl),a
                                      scf
                                      ret

                                      Which is as far as I know the only way to do that atomically on a z80

                                      O Offline
                                      O Offline
                                      Old Ed
                                      wrote on last edited by
                                      #20

                                      Are those actual instructions? They're beautiful! It's been so long...sorry I'm having a moment!

                                      1 Reply Last reply
                                      0
                                      • OriginalGriffO OriginalGriff

                                        Very true: the days when you could optimise code or data just based on the actual instructions and their relative cycle times are long gone! Level 1 & 2 caches, pipelines and multiple core execution make a lot more difference than raw processor power. Ah! For the days when you could double the performance of your PC just by changing the HDD interleave... That used to be my "Friday-afternoon-look-busy" job - while I played minesweeper :-O

                                        Real men don't use instructions. They are only the manufacturers opinion on how to put the thing together. Manfred R. Bihy: "Looks as if OP is learning resistant."

                                        F Offline
                                        F Offline
                                        Fabio Franco
                                        wrote on last edited by
                                        #21

                                        OriginalGriff wrote:

                                        Ah! For the days when you could double the performance of your PC just by changing the HDD interleave...

                                        Humm, I don't remember that, but I do remember to love pushing the Turbo Button that made my PC MHz display go from 33 to 66.

                                        OriginalGriffO F 2 Replies Last reply
                                        0
                                        • F Fabio Franco

                                          OriginalGriff wrote:

                                          Ah! For the days when you could double the performance of your PC just by changing the HDD interleave...

                                          Humm, I don't remember that, but I do remember to love pushing the Turbo Button that made my PC MHz display go from 33 to 66.

                                          OriginalGriffO Offline
                                          OriginalGriffO Offline
                                          OriginalGriff
                                          wrote on last edited by
                                          #22

                                          I wired mine so it changed from 8 to 16. Made people think it was slower than it was so it didn't suddenly "loose parts" overnight...

                                          Real men don't use instructions. They are only the manufacturers opinion on how to put the thing together. Manfred R. Bihy: "Looks as if OP is learning resistant."

                                          "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
                                          "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Don't have an account? Register

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups