Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. The Lounge
  3. A piece of criticism on "to go fast, do less"

A piece of criticism on "to go fast, do less"

Scheduled Pinned Locked Moved The Lounge
cssdata-structuresarchitecturetutoriallearning
41 Posts 17 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • D David1987

    Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

    K Offline
    K Offline
    Klaus Werner Konrad
    wrote on last edited by
    #12

    Hmm... may be I'm not up to date, but do you remember the thread 'Why I will not hire .Net programmers' ? If you have to work with such complex frameworks, there is THIS much that you have no control about, that 'cache misses' and so on doesn't seem controllable. And AFAIK all .Net languages don't create real machine code, but code for MSIL (wich is some sort of interpreted = at runtime translated code), so what can you optimize there ? The article describes clearly that (especially inside loops) you should do only the least required operations; everything else (like precautions etc.) should be performed outside the loop. Of course, the compilers today can help you in many ways by self-optimizing the code, but if you optimize it by yourself in before, you 1) can be (almost) sure, that is is optimal 2) document the optimization to the guy that must maintenance your code 3) don't rely on compiler options and -versions Also sometimes (like me now) you have to struggle with development systems like TeamDeveloper, that don't utilize multiple cores, multiple threads etc. - then every single instruction counts REALLY

    D 1 Reply Last reply
    0
    • K Klaus Werner Konrad

      Hmm... may be I'm not up to date, but do you remember the thread 'Why I will not hire .Net programmers' ? If you have to work with such complex frameworks, there is THIS much that you have no control about, that 'cache misses' and so on doesn't seem controllable. And AFAIK all .Net languages don't create real machine code, but code for MSIL (wich is some sort of interpreted = at runtime translated code), so what can you optimize there ? The article describes clearly that (especially inside loops) you should do only the least required operations; everything else (like precautions etc.) should be performed outside the loop. Of course, the compilers today can help you in many ways by self-optimizing the code, but if you optimize it by yourself in before, you 1) can be (almost) sure, that is is optimal 2) document the optimization to the guy that must maintenance your code 3) don't rely on compiler options and -versions Also sometimes (like me now) you have to struggle with development systems like TeamDeveloper, that don't utilize multiple cores, multiple threads etc. - then every single instruction counts REALLY

      D Offline
      D Offline
      David1987
      wrote on last edited by
      #13

      Ah yes if you're using frameworks there's not much you can do.. I've found that, in general, MSIL compiles to native code in way that is often predictable (when it concerns native types such as int etc), generally the code is not reordered in any way so you can still control the fine-grained parallelism somewhat - and for example the multiple accumulator trick still works fine :)

      K 1 Reply Last reply
      0
      • D David1987

        Ah yes if you're using frameworks there's not much you can do.. I've found that, in general, MSIL compiles to native code in way that is often predictable (when it concerns native types such as int etc), generally the code is not reordered in any way so you can still control the fine-grained parallelism somewhat - and for example the multiple accumulator trick still works fine :)

        K Offline
        K Offline
        Klaus Werner Konrad
        wrote on last edited by
        #14

        "Ah yes" - ah yes, cou can concern that 'in general' MSIL 'compiles' your code ... BUT - that's not my point - it was that you even CANNOT RELY on that it (the framework) optimizes your code - and I have a doubt that it even can do ANY optimizations, 'cause the tryto optimize would effect the average performance ...

        D 1 Reply Last reply
        0
        • K Klaus Werner Konrad

          "Ah yes" - ah yes, cou can concern that 'in general' MSIL 'compiles' your code ... BUT - that's not my point - it was that you even CANNOT RELY on that it (the framework) optimizes your code - and I have a doubt that it even can do ANY optimizations, 'cause the tryto optimize would effect the average performance ...

          D Offline
          D Offline
          David1987
          wrote on last edited by
          #15

          It does inlining and loop-invariant code motion, as for scalar optimization, well, it optimizes multiplications by small constants and division by constants, that's about it..

          1 Reply Last reply
          0
          • D David1987

            Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

            M Offline
            M Offline
            mirdones
            wrote on last edited by
            #16

            So you are suposing everyone is using a computer arquitecture that has cache, multi-core and so on. Many people actually do not work on pcs or macs, but with embedded devices, so cpu instructions still count a lot. But indeed, to go fast, do smart is the best advice. But to do smart you need to know the arquitecture you are working in. There is no silver bullet.

            D 1 Reply Last reply
            0
            • M mirdones

              So you are suposing everyone is using a computer arquitecture that has cache, multi-core and so on. Many people actually do not work on pcs or macs, but with embedded devices, so cpu instructions still count a lot. But indeed, to go fast, do smart is the best advice. But to do smart you need to know the arquitecture you are working in. There is no silver bullet.

              D Offline
              D Offline
              David1987
              wrote on last edited by
              #17

              I program for a good old z80 too sometimes, but surely that's not where the majority of programming happens?

              M 1 Reply Last reply
              0
              • D David1987

                Or the days when self modifying code was a Good Thing and you had no multiply instruction :)

                B Offline
                B Offline
                BrainiacV
                wrote on last edited by
                #18

                There was never a day when self-modifying code was a good idea. Many a time I'd be tripped up by changing a branch statement and then have that branch statement be overwritten by a programmer in another module because he thought he was "Oh so clever" by replacing conditional branches with hard branches and NOPs to make it run faster. I ranked that with those that would jump into the middle of a multibyte instruction, expecting it to have the effect of a NOP and thereby save themselves 1 byte. I sorta see their logic, it was during the time when we had to jump up and down on those 1KB EPROMs to make the program fit, but I always thought the better solution was to restructure the code and be more clever than to resort to tricks that were unmaintainable.

                Psychosis at 10 Film at 11

                D E 2 Replies Last reply
                0
                • B BrainiacV

                  There was never a day when self-modifying code was a good idea. Many a time I'd be tripped up by changing a branch statement and then have that branch statement be overwritten by a programmer in another module because he thought he was "Oh so clever" by replacing conditional branches with hard branches and NOPs to make it run faster. I ranked that with those that would jump into the middle of a multibyte instruction, expecting it to have the effect of a NOP and thereby save themselves 1 byte. I sorta see their logic, it was during the time when we had to jump up and down on those 1KB EPROMs to make the program fit, but I always thought the better solution was to restructure the code and be more clever than to resort to tricks that were unmaintainable.

                  Psychosis at 10 Film at 11

                  D Offline
                  D Offline
                  David1987
                  wrote on last edited by
                  #19

                  There were processors on which SMC was required in order to do certain normal things, such as returning from a subroutine. More often, I've used SMC because things such as bit-tests would only take a constant, and having 8 copies would have been an unacceptable waste of space. Somewhat less often, I've had to resort to SMC like this:

                  ld hl,AddressOfLock
                  call _jphl ; jp (hl) at that label
                  ..

                  AddressOfLock:
                  or a
                  ld a,$C9 ; C9 = opcode for ret
                  inc hl
                  inc hl
                  ld (hl),a
                  scf
                  ret

                  Which is as far as I know the only way to do that atomically on a z80

                  O B 2 Replies Last reply
                  0
                  • D David1987

                    There were processors on which SMC was required in order to do certain normal things, such as returning from a subroutine. More often, I've used SMC because things such as bit-tests would only take a constant, and having 8 copies would have been an unacceptable waste of space. Somewhat less often, I've had to resort to SMC like this:

                    ld hl,AddressOfLock
                    call _jphl ; jp (hl) at that label
                    ..

                    AddressOfLock:
                    or a
                    ld a,$C9 ; C9 = opcode for ret
                    inc hl
                    inc hl
                    ld (hl),a
                    scf
                    ret

                    Which is as far as I know the only way to do that atomically on a z80

                    O Offline
                    O Offline
                    Old Ed
                    wrote on last edited by
                    #20

                    Are those actual instructions? They're beautiful! It's been so long...sorry I'm having a moment!

                    1 Reply Last reply
                    0
                    • OriginalGriffO OriginalGriff

                      Very true: the days when you could optimise code or data just based on the actual instructions and their relative cycle times are long gone! Level 1 & 2 caches, pipelines and multiple core execution make a lot more difference than raw processor power. Ah! For the days when you could double the performance of your PC just by changing the HDD interleave... That used to be my "Friday-afternoon-look-busy" job - while I played minesweeper :-O

                      Real men don't use instructions. They are only the manufacturers opinion on how to put the thing together. Manfred R. Bihy: "Looks as if OP is learning resistant."

                      F Offline
                      F Offline
                      Fabio Franco
                      wrote on last edited by
                      #21

                      OriginalGriff wrote:

                      Ah! For the days when you could double the performance of your PC just by changing the HDD interleave...

                      Humm, I don't remember that, but I do remember to love pushing the Turbo Button that made my PC MHz display go from 33 to 66.

                      OriginalGriffO F 2 Replies Last reply
                      0
                      • F Fabio Franco

                        OriginalGriff wrote:

                        Ah! For the days when you could double the performance of your PC just by changing the HDD interleave...

                        Humm, I don't remember that, but I do remember to love pushing the Turbo Button that made my PC MHz display go from 33 to 66.

                        OriginalGriffO Offline
                        OriginalGriffO Offline
                        OriginalGriff
                        wrote on last edited by
                        #22

                        I wired mine so it changed from 8 to 16. Made people think it was slower than it was so it didn't suddenly "loose parts" overnight...

                        Real men don't use instructions. They are only the manufacturers opinion on how to put the thing together. Manfred R. Bihy: "Looks as if OP is learning resistant."

                        "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
                        "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt

                        1 Reply Last reply
                        0
                        • B BrainiacV

                          There was never a day when self-modifying code was a good idea. Many a time I'd be tripped up by changing a branch statement and then have that branch statement be overwritten by a programmer in another module because he thought he was "Oh so clever" by replacing conditional branches with hard branches and NOPs to make it run faster. I ranked that with those that would jump into the middle of a multibyte instruction, expecting it to have the effect of a NOP and thereby save themselves 1 byte. I sorta see their logic, it was during the time when we had to jump up and down on those 1KB EPROMs to make the program fit, but I always thought the better solution was to restructure the code and be more clever than to resort to tricks that were unmaintainable.

                          Psychosis at 10 Film at 11

                          E Offline
                          E Offline
                          Earl Truss
                          wrote on last edited by
                          #23

                          Self-modifying code isn't really the problem here. If one module is changing code in another one, you've got bigger problems than SMC issues. There were cases where self-modifying code was good like when a subroutine would modify itself on first entry so the initialization code was never executed again. This could save a lot of time on succeeding calls. One example I remember (dimly so this may not be exactly right) was on CDC machines where the first call to SYS= would determine if the hardware exchange instruction existed on the machine. If it did, the subroutine modified itself so it would always go to the code that executed that instruction. If it did not, the subroutine would wait in a loop for the signal that the system call had been accepted (RA+1 would go to zero).

                          1 Reply Last reply
                          0
                          • F Fabio Franco

                            OriginalGriff wrote:

                            Ah! For the days when you could double the performance of your PC just by changing the HDD interleave...

                            Humm, I don't remember that, but I do remember to love pushing the Turbo Button that made my PC MHz display go from 33 to 66.

                            F Offline
                            F Offline
                            firegryphon
                            wrote on last edited by
                            #24

                            Why did it do that anyway?  Was there ever a point of not being in turbo?

                            ragnaroknrol: Yes, but comparing a rabid wolverine gnawing on your face while stabbing you with a fountain pen to Vista is likely to make the wolverine look good, so it isn't exactly that big of a compliment.

                            F 1 Reply Last reply
                            0
                            • F firegryphon

                              Why did it do that anyway?  Was there ever a point of not being in turbo?

                              ragnaroknrol: Yes, but comparing a rabid wolverine gnawing on your face while stabbing you with a fountain pen to Vista is likely to make the wolverine look good, so it isn't exactly that big of a compliment.

                              F Offline
                              F Offline
                              Fabio Franco
                              wrote on last edited by
                              #25

                              I'm not really sure, perhaps it had the purpose of power saving, heat reduction or even extending the processor lifetime... Poor days those times...

                              F 1 Reply Last reply
                              0
                              • F Fabio Franco

                                I'm not really sure, perhaps it had the purpose of power saving, heat reduction or even extending the processor lifetime... Poor days those times...

                                F Offline
                                F Offline
                                firegryphon
                                wrote on last edited by
                                #26

                                I know why some had switches in earlier models, cause if it ran too fast your games would be all out of whack (as noone was.  I can't say I miss the days of having to put in individual memory chips, or aligning those SIP pins.  I <3 my DIMM/SO-DIMM.  I also really really love the newer BGA processors.  Gosh its easy to build and upgrade a computer now.

                                ragnaroknrol: Yes, but comparing a rabid wolverine gnawing on your face while stabbing you with a fountain pen to Vista is likely to make the wolverine look good, so it isn't exactly that big of a compliment.

                                F 1 Reply Last reply
                                0
                                • F firegryphon

                                  I know why some had switches in earlier models, cause if it ran too fast your games would be all out of whack (as noone was.  I can't say I miss the days of having to put in individual memory chips, or aligning those SIP pins.  I <3 my DIMM/SO-DIMM.  I also really really love the newer BGA processors.  Gosh its easy to build and upgrade a computer now.

                                  ragnaroknrol: Yes, but comparing a rabid wolverine gnawing on your face while stabbing you with a fountain pen to Vista is likely to make the wolverine look good, so it isn't exactly that big of a compliment.

                                  F Offline
                                  F Offline
                                  Fabio Franco
                                  wrote on last edited by
                                  #27

                                  firegryphon wrote:

                                  aligning those SIP pins

                                  Ohh... good old days, the last memory I have of that is of a small blue box on the motherboard with several white pins... I think they made me cause fury on my dad once... I actually miss that, I don't know if that's because I was a kid and loved to mess things up (and often break), but now it's so simple that it's no fun to upgrade anymore...

                                  1 Reply Last reply
                                  0
                                  • D David1987

                                    Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

                                    E Offline
                                    E Offline
                                    ely_bob
                                    wrote on last edited by
                                    #28

                                    Shameless Plug:How To Avoid Cache Misses[^]

                                    I'd blame it on the Brain farts.. But let's be honest, it really is more like a Methane factory between my ears some days then it is anything else...
                                    -----
                                    "The conversations he was having with himself were becoming ominous."-.. On the radio...

                                    1 Reply Last reply
                                    0
                                    • D David1987

                                      Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

                                      S Offline
                                      S Offline
                                      SeattleC
                                      wrote on last edited by
                                      #29

                                      Doing less, that is, removing computation from inner loops, is virtually certain to improve performance on any architecture. After that, doing computation in parallel, or with concern for cache locality may provide additional optimization on architectures with respectively multiple cores and multi-level memory architectures. However, performing thsee hardware-dependent optimizations on anything but optimized single-thread code is almost certainly not the route to maximize performance. And it is definitely the route to more-expensive-to-maintain code, because these optimizations must be revisited for each new architecture supported. If you are writing a Windows application that will end up running on everything from a 486 to a Core i7, plus GPUs, you have a truly massive job to make your code optimal on all systems. Pretty much the only way not to fail utterly is to use a parallelization library like TBB to perform the parallelization at runtime in an architecture-dependent way.

                                      D 1 Reply Last reply
                                      0
                                      • S SeattleC

                                        Doing less, that is, removing computation from inner loops, is virtually certain to improve performance on any architecture. After that, doing computation in parallel, or with concern for cache locality may provide additional optimization on architectures with respectively multiple cores and multi-level memory architectures. However, performing thsee hardware-dependent optimizations on anything but optimized single-thread code is almost certainly not the route to maximize performance. And it is definitely the route to more-expensive-to-maintain code, because these optimizations must be revisited for each new architecture supported. If you are writing a Windows application that will end up running on everything from a 486 to a Core i7, plus GPUs, you have a truly massive job to make your code optimal on all systems. Pretty much the only way not to fail utterly is to use a parallelization library like TBB to perform the parallelization at runtime in an architecture-dependent way.

                                        D Offline
                                        D Offline
                                        David1987
                                        wrote on last edited by
                                        #30

                                        Are you sure? I give you this:

                                                    ; ecx = index of starting node
                                                    ; rdx -> edges
                                        
                                                    xor eax, eax
                                                    or r8, -1
                                                    xor r9d, r9d
                                                    jmp next
                                                    .align 16
                                                loop:
                                                    add eax, 1
                                                    lea r10, \[r9-1\]
                                                    bsf rcx, r9
                                                    nop     ; for alignment of 'next'
                                                    and r9, r10
                                                    nop     ; for alignment of 'next'
                                                next:
                                                    btr r8, rcx
                                                    or r9, \[rdx+rcx\*8\]
                                                    and r9, r8
                                                    jnz loop
                                                    ret
                                        

                                        If you remove the two nops (thereby "doing less" in the inner loop) the time taken goes up by about 5% on a core2. This code will never run on a Pentium because it's 64bits. Suppose it was 32bits and ran on a 80486, the nops would slow it down - just a little, nothing to worry about.

                                        Member 2941392 wrote:

                                        If you are writing a Windows application that will end up running on everything from a 486 to a Core i7

                                        It wont. Windows XP requires at least a Pentium. Are you honestly going to support Win98 or earlier? In any case, it is usually acceptable to optimize it for the most popular architecture even though that will slow it down on others. The slowdown is almost always tiny, whereas the speedup may be x2 in some cases. These days that would be Core2 or i7. By the way, TBB doesn't help at all for single-core architectures. Things like "branches are bad" and "align branch targets" apply to almost all architectures (except the ancient ones that you don't have to support anyway) - doing it for the ones it doesn't apply to doesn't hurt that badly.

                                        S 1 Reply Last reply
                                        0
                                        • D David1987

                                          Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

                                          S Offline
                                          S Offline
                                          Stefan_Lang
                                          wrote on last edited by
                                          #31

                                          For single processor systems the line still holds, but the 'less' part refers to the amount of instructions the CPU must perform, not the amount of code: It often requires a lot more code if you want to make sure the CPU will perform the least possible amount of instructions. Most of the time you achieve that by branching on multiple cases (or use virtual functions for basically the same effect) and choose the best algorithm for each case seperately rather than using one standard algorithm that works for all cases, but may be inefficient for a lot of common special cases - which would be a lot less code. For Multiprocessor systems or programs taking advantage of massive parallelism through GPGPU the fastest code is usually neither the shortest nor the one with the least total number of instructions, since the main point here is making as much of the code parallel as possible (or at least the parts that consume notable amounts of CPU/GPU time). At the very least you have to add code for forking and joining, but most of the time you'll also have to use new algorithms; many of the standard algorithms simply don't scale well on parallel systems, or cannot be parallelized at all. I do not think cache misses or similar low level considerations are or should be a factor. This is the task of the compiler, not the developer. If you're having trouble on that level, you're using a bad compiler, IMHO.

                                          D 1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Don't have an account? Register

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups