Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. The Lounge
  3. A piece of criticism on "to go fast, do less"

A piece of criticism on "to go fast, do less"

Scheduled Pinned Locked Moved The Lounge
cssdata-structuresarchitecturetutoriallearning
41 Posts 17 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • B BrainiacV

    There was never a day when self-modifying code was a good idea. Many a time I'd be tripped up by changing a branch statement and then have that branch statement be overwritten by a programmer in another module because he thought he was "Oh so clever" by replacing conditional branches with hard branches and NOPs to make it run faster. I ranked that with those that would jump into the middle of a multibyte instruction, expecting it to have the effect of a NOP and thereby save themselves 1 byte. I sorta see their logic, it was during the time when we had to jump up and down on those 1KB EPROMs to make the program fit, but I always thought the better solution was to restructure the code and be more clever than to resort to tricks that were unmaintainable.

    Psychosis at 10 Film at 11

    E Offline
    E Offline
    Earl Truss
    wrote on last edited by
    #23

    Self-modifying code isn't really the problem here. If one module is changing code in another one, you've got bigger problems than SMC issues. There were cases where self-modifying code was good like when a subroutine would modify itself on first entry so the initialization code was never executed again. This could save a lot of time on succeeding calls. One example I remember (dimly so this may not be exactly right) was on CDC machines where the first call to SYS= would determine if the hardware exchange instruction existed on the machine. If it did, the subroutine modified itself so it would always go to the code that executed that instruction. If it did not, the subroutine would wait in a loop for the signal that the system call had been accepted (RA+1 would go to zero).

    1 Reply Last reply
    0
    • F Fabio Franco

      OriginalGriff wrote:

      Ah! For the days when you could double the performance of your PC just by changing the HDD interleave...

      Humm, I don't remember that, but I do remember to love pushing the Turbo Button that made my PC MHz display go from 33 to 66.

      F Offline
      F Offline
      firegryphon
      wrote on last edited by
      #24

      Why did it do that anyway?  Was there ever a point of not being in turbo?

      ragnaroknrol: Yes, but comparing a rabid wolverine gnawing on your face while stabbing you with a fountain pen to Vista is likely to make the wolverine look good, so it isn't exactly that big of a compliment.

      F 1 Reply Last reply
      0
      • F firegryphon

        Why did it do that anyway?  Was there ever a point of not being in turbo?

        ragnaroknrol: Yes, but comparing a rabid wolverine gnawing on your face while stabbing you with a fountain pen to Vista is likely to make the wolverine look good, so it isn't exactly that big of a compliment.

        F Offline
        F Offline
        Fabio Franco
        wrote on last edited by
        #25

        I'm not really sure, perhaps it had the purpose of power saving, heat reduction or even extending the processor lifetime... Poor days those times...

        F 1 Reply Last reply
        0
        • F Fabio Franco

          I'm not really sure, perhaps it had the purpose of power saving, heat reduction or even extending the processor lifetime... Poor days those times...

          F Offline
          F Offline
          firegryphon
          wrote on last edited by
          #26

          I know why some had switches in earlier models, cause if it ran too fast your games would be all out of whack (as noone was.  I can't say I miss the days of having to put in individual memory chips, or aligning those SIP pins.  I <3 my DIMM/SO-DIMM.  I also really really love the newer BGA processors.  Gosh its easy to build and upgrade a computer now.

          ragnaroknrol: Yes, but comparing a rabid wolverine gnawing on your face while stabbing you with a fountain pen to Vista is likely to make the wolverine look good, so it isn't exactly that big of a compliment.

          F 1 Reply Last reply
          0
          • F firegryphon

            I know why some had switches in earlier models, cause if it ran too fast your games would be all out of whack (as noone was.  I can't say I miss the days of having to put in individual memory chips, or aligning those SIP pins.  I <3 my DIMM/SO-DIMM.  I also really really love the newer BGA processors.  Gosh its easy to build and upgrade a computer now.

            ragnaroknrol: Yes, but comparing a rabid wolverine gnawing on your face while stabbing you with a fountain pen to Vista is likely to make the wolverine look good, so it isn't exactly that big of a compliment.

            F Offline
            F Offline
            Fabio Franco
            wrote on last edited by
            #27

            firegryphon wrote:

            aligning those SIP pins

            Ohh... good old days, the last memory I have of that is of a small blue box on the motherboard with several white pins... I think they made me cause fury on my dad once... I actually miss that, I don't know if that's because I was a kid and loved to mess things up (and often break), but now it's so simple that it's no fun to upgrade anymore...

            1 Reply Last reply
            0
            • D David1987

              Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

              E Offline
              E Offline
              ely_bob
              wrote on last edited by
              #28

              Shameless Plug:How To Avoid Cache Misses[^]

              I'd blame it on the Brain farts.. But let's be honest, it really is more like a Methane factory between my ears some days then it is anything else...
              -----
              "The conversations he was having with himself were becoming ominous."-.. On the radio...

              1 Reply Last reply
              0
              • D David1987

                Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

                S Offline
                S Offline
                SeattleC
                wrote on last edited by
                #29

                Doing less, that is, removing computation from inner loops, is virtually certain to improve performance on any architecture. After that, doing computation in parallel, or with concern for cache locality may provide additional optimization on architectures with respectively multiple cores and multi-level memory architectures. However, performing thsee hardware-dependent optimizations on anything but optimized single-thread code is almost certainly not the route to maximize performance. And it is definitely the route to more-expensive-to-maintain code, because these optimizations must be revisited for each new architecture supported. If you are writing a Windows application that will end up running on everything from a 486 to a Core i7, plus GPUs, you have a truly massive job to make your code optimal on all systems. Pretty much the only way not to fail utterly is to use a parallelization library like TBB to perform the parallelization at runtime in an architecture-dependent way.

                D 1 Reply Last reply
                0
                • S SeattleC

                  Doing less, that is, removing computation from inner loops, is virtually certain to improve performance on any architecture. After that, doing computation in parallel, or with concern for cache locality may provide additional optimization on architectures with respectively multiple cores and multi-level memory architectures. However, performing thsee hardware-dependent optimizations on anything but optimized single-thread code is almost certainly not the route to maximize performance. And it is definitely the route to more-expensive-to-maintain code, because these optimizations must be revisited for each new architecture supported. If you are writing a Windows application that will end up running on everything from a 486 to a Core i7, plus GPUs, you have a truly massive job to make your code optimal on all systems. Pretty much the only way not to fail utterly is to use a parallelization library like TBB to perform the parallelization at runtime in an architecture-dependent way.

                  D Offline
                  D Offline
                  David1987
                  wrote on last edited by
                  #30

                  Are you sure? I give you this:

                              ; ecx = index of starting node
                              ; rdx -> edges
                  
                              xor eax, eax
                              or r8, -1
                              xor r9d, r9d
                              jmp next
                              .align 16
                          loop:
                              add eax, 1
                              lea r10, \[r9-1\]
                              bsf rcx, r9
                              nop     ; for alignment of 'next'
                              and r9, r10
                              nop     ; for alignment of 'next'
                          next:
                              btr r8, rcx
                              or r9, \[rdx+rcx\*8\]
                              and r9, r8
                              jnz loop
                              ret
                  

                  If you remove the two nops (thereby "doing less" in the inner loop) the time taken goes up by about 5% on a core2. This code will never run on a Pentium because it's 64bits. Suppose it was 32bits and ran on a 80486, the nops would slow it down - just a little, nothing to worry about.

                  Member 2941392 wrote:

                  If you are writing a Windows application that will end up running on everything from a 486 to a Core i7

                  It wont. Windows XP requires at least a Pentium. Are you honestly going to support Win98 or earlier? In any case, it is usually acceptable to optimize it for the most popular architecture even though that will slow it down on others. The slowdown is almost always tiny, whereas the speedup may be x2 in some cases. These days that would be Core2 or i7. By the way, TBB doesn't help at all for single-core architectures. Things like "branches are bad" and "align branch targets" apply to almost all architectures (except the ancient ones that you don't have to support anyway) - doing it for the ones it doesn't apply to doesn't hurt that badly.

                  S 1 Reply Last reply
                  0
                  • D David1987

                    Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

                    S Offline
                    S Offline
                    Stefan_Lang
                    wrote on last edited by
                    #31

                    For single processor systems the line still holds, but the 'less' part refers to the amount of instructions the CPU must perform, not the amount of code: It often requires a lot more code if you want to make sure the CPU will perform the least possible amount of instructions. Most of the time you achieve that by branching on multiple cases (or use virtual functions for basically the same effect) and choose the best algorithm for each case seperately rather than using one standard algorithm that works for all cases, but may be inefficient for a lot of common special cases - which would be a lot less code. For Multiprocessor systems or programs taking advantage of massive parallelism through GPGPU the fastest code is usually neither the shortest nor the one with the least total number of instructions, since the main point here is making as much of the code parallel as possible (or at least the parts that consume notable amounts of CPU/GPU time). At the very least you have to add code for forking and joining, but most of the time you'll also have to use new algorithms; many of the standard algorithms simply don't scale well on parallel systems, or cannot be parallelized at all. I do not think cache misses or similar low level considerations are or should be a factor. This is the task of the compiler, not the developer. If you're having trouble on that level, you're using a bad compiler, IMHO.

                    D 1 Reply Last reply
                    0
                    • S Stefan_Lang

                      For single processor systems the line still holds, but the 'less' part refers to the amount of instructions the CPU must perform, not the amount of code: It often requires a lot more code if you want to make sure the CPU will perform the least possible amount of instructions. Most of the time you achieve that by branching on multiple cases (or use virtual functions for basically the same effect) and choose the best algorithm for each case seperately rather than using one standard algorithm that works for all cases, but may be inefficient for a lot of common special cases - which would be a lot less code. For Multiprocessor systems or programs taking advantage of massive parallelism through GPGPU the fastest code is usually neither the shortest nor the one with the least total number of instructions, since the main point here is making as much of the code parallel as possible (or at least the parts that consume notable amounts of CPU/GPU time). At the very least you have to add code for forking and joining, but most of the time you'll also have to use new algorithms; many of the standard algorithms simply don't scale well on parallel systems, or cannot be parallelized at all. I do not think cache misses or similar low level considerations are or should be a factor. This is the task of the compiler, not the developer. If you're having trouble on that level, you're using a bad compiler, IMHO.

                      D Offline
                      D Offline
                      David1987
                      wrote on last edited by
                      #32

                      Well, not quite. Single processor or not doesn't have anything to do with it really. The problem is the out-of-order superscalar nature of processors these days - further complicated by execution port clashes and register read stalls and such. Writing code that uses the full theoretical throughput for a single core is hard and often impossible. And as for blaming cache trashing on the compiler.. all compilers would be bad. There are some things they can do, such as look skewing (and not many compilers do that), but designing data structures and algorithm with good cache behaviour are not things it can do. You may not want it to be a factor, but it certainly is! A CPU these days can execute several hundreds of instructions in one cache miss - it is by no means an insignificant factor.

                      S 1 Reply Last reply
                      0
                      • D David1987

                        Well, not quite. Single processor or not doesn't have anything to do with it really. The problem is the out-of-order superscalar nature of processors these days - further complicated by execution port clashes and register read stalls and such. Writing code that uses the full theoretical throughput for a single core is hard and often impossible. And as for blaming cache trashing on the compiler.. all compilers would be bad. There are some things they can do, such as look skewing (and not many compilers do that), but designing data structures and algorithm with good cache behaviour are not things it can do. You may not want it to be a factor, but it certainly is! A CPU these days can execute several hundreds of instructions in one cache miss - it is by no means an insignificant factor.

                        S Offline
                        S Offline
                        Stefan_Lang
                        wrote on last edited by
                        #33

                        You're right, I haven't considered the way even single-processor systems already parallelize instructions. So, I guess, the fastest code is one that is designed for parallelism, even on a single processor system. Regarding low level stuff, that is something I've stopped worrying about some 15-20 years ago, so I have to admit I have little knowledge of these effects in modern processor architectures. Still, it is often only a very small part of a program that is causing a performance bottleneck, and IME, more often than not, it is the low quality algorithm that is to blame. Not saying to deal with low level aspects couldn't improve performance even further, but when you are developing an application that should run e. g. on any PC with a certain OS, then you don't really know which processor architecture to optimize for in the first place.

                        1 Reply Last reply
                        0
                        • D David1987

                          I program for a good old z80 too sometimes, but surely that's not where the majority of programming happens?

                          M Offline
                          M Offline
                          mirdones
                          wrote on last edited by
                          #34

                          Actually I program for fighter aircraft operational flight programs. CPU, Bus Load, memory management, among others, are all very critical for our development, as our resources are very limited and made to last for decades. It is not the same as a microcontroller, but it is also very different than a normal pc architecture.

                          1 Reply Last reply
                          0
                          • D David1987

                            Are you sure? I give you this:

                                        ; ecx = index of starting node
                                        ; rdx -> edges
                            
                                        xor eax, eax
                                        or r8, -1
                                        xor r9d, r9d
                                        jmp next
                                        .align 16
                                    loop:
                                        add eax, 1
                                        lea r10, \[r9-1\]
                                        bsf rcx, r9
                                        nop     ; for alignment of 'next'
                                        and r9, r10
                                        nop     ; for alignment of 'next'
                                    next:
                                        btr r8, rcx
                                        or r9, \[rdx+rcx\*8\]
                                        and r9, r8
                                        jnz loop
                                        ret
                            

                            If you remove the two nops (thereby "doing less" in the inner loop) the time taken goes up by about 5% on a core2. This code will never run on a Pentium because it's 64bits. Suppose it was 32bits and ran on a 80486, the nops would slow it down - just a little, nothing to worry about.

                            Member 2941392 wrote:

                            If you are writing a Windows application that will end up running on everything from a 486 to a Core i7

                            It wont. Windows XP requires at least a Pentium. Are you honestly going to support Win98 or earlier? In any case, it is usually acceptable to optimize it for the most popular architecture even though that will slow it down on others. The slowdown is almost always tiny, whereas the speedup may be x2 in some cases. These days that would be Core2 or i7. By the way, TBB doesn't help at all for single-core architectures. Things like "branches are bad" and "align branch targets" apply to almost all architectures (except the ancient ones that you don't have to support anyway) - doing it for the ones it doesn't apply to doesn't hurt that badly.

                            S Offline
                            S Offline
                            SeattleC
                            wrote on last edited by
                            #35

                            Well I guess our opinions differ.

                            David1987 wrote:

                            Windows XP requires at least a Pentium. Are you honestly going to support Win98 or earlier?

                            You forget Windows 2000, and yeah, sometimes you have to support all the way back to Windows 2000. You don't get to choose. Your employer/customer chooses.

                            David1987 wrote:

                            If you remove the two nops (thereby "doing less" in the inner loop) the time taken goes up by about 5% on a core2.

                            Yes, yes, you're very manly; you can add no-ops to improve performance on a particular architecture, and cost performance on another architecture. But that was entirely the point; removing meaningful computation is always a win. Adding architecture-specific optimization is a win on that architecture, and a loss on some other one. With currently shipping x86 architectures ranging from embedded 486s through single-core Atom processors up to monstrous Core i7s, not to mention the fine product lineup from AMD, architecture-specific optimization is problematic. PCs are far less homogeneous than they were 10 years ago. The advice to optimize for an ideal, most-likely target is less valuable than it once was. With today's heterogeneous PC environment, many optimizations can only be done at run-time when you know how many cores you have, how much cache, etc. Hence the use of optimizing tools like TBB. Yeah, TBB cannot always improve performance on a single-core system, though it would depend on the balance between memory fetching and computation. But it can dynamically take advantage of the number of cores available, so you don't wire a limitation into the code. So my advice would be that if you remove meaningful computation and stop there, you have done well. You can go further with processor-specific optimizations, but only if your target is a specific processor out of a sea of possibilities, or if you want to write processor-specific libraries because you are desparate for performance at any price.

                            D 2 Replies Last reply
                            0
                            • S SeattleC

                              Well I guess our opinions differ.

                              David1987 wrote:

                              Windows XP requires at least a Pentium. Are you honestly going to support Win98 or earlier?

                              You forget Windows 2000, and yeah, sometimes you have to support all the way back to Windows 2000. You don't get to choose. Your employer/customer chooses.

                              David1987 wrote:

                              If you remove the two nops (thereby "doing less" in the inner loop) the time taken goes up by about 5% on a core2.

                              Yes, yes, you're very manly; you can add no-ops to improve performance on a particular architecture, and cost performance on another architecture. But that was entirely the point; removing meaningful computation is always a win. Adding architecture-specific optimization is a win on that architecture, and a loss on some other one. With currently shipping x86 architectures ranging from embedded 486s through single-core Atom processors up to monstrous Core i7s, not to mention the fine product lineup from AMD, architecture-specific optimization is problematic. PCs are far less homogeneous than they were 10 years ago. The advice to optimize for an ideal, most-likely target is less valuable than it once was. With today's heterogeneous PC environment, many optimizations can only be done at run-time when you know how many cores you have, how much cache, etc. Hence the use of optimizing tools like TBB. Yeah, TBB cannot always improve performance on a single-core system, though it would depend on the balance between memory fetching and computation. But it can dynamically take advantage of the number of cores available, so you don't wire a limitation into the code. So my advice would be that if you remove meaningful computation and stop there, you have done well. You can go further with processor-specific optimizations, but only if your target is a specific processor out of a sea of possibilities, or if you want to write processor-specific libraries because you are desparate for performance at any price.

                              D Offline
                              D Offline
                              David1987
                              wrote on last edited by
                              #36

                              Even if you have to support "everything", why would it have to be optimal on everything? That surely isn't worth the time it would cost..

                              Member 2941392 wrote:

                              removing meaningful computation is always a win.

                              Well no. Almost always, sure. Not always though. You may cause a register read stall that way, for example.

                              1 Reply Last reply
                              0
                              • S SeattleC

                                Well I guess our opinions differ.

                                David1987 wrote:

                                Windows XP requires at least a Pentium. Are you honestly going to support Win98 or earlier?

                                You forget Windows 2000, and yeah, sometimes you have to support all the way back to Windows 2000. You don't get to choose. Your employer/customer chooses.

                                David1987 wrote:

                                If you remove the two nops (thereby "doing less" in the inner loop) the time taken goes up by about 5% on a core2.

                                Yes, yes, you're very manly; you can add no-ops to improve performance on a particular architecture, and cost performance on another architecture. But that was entirely the point; removing meaningful computation is always a win. Adding architecture-specific optimization is a win on that architecture, and a loss on some other one. With currently shipping x86 architectures ranging from embedded 486s through single-core Atom processors up to monstrous Core i7s, not to mention the fine product lineup from AMD, architecture-specific optimization is problematic. PCs are far less homogeneous than they were 10 years ago. The advice to optimize for an ideal, most-likely target is less valuable than it once was. With today's heterogeneous PC environment, many optimizations can only be done at run-time when you know how many cores you have, how much cache, etc. Hence the use of optimizing tools like TBB. Yeah, TBB cannot always improve performance on a single-core system, though it would depend on the balance between memory fetching and computation. But it can dynamically take advantage of the number of cores available, so you don't wire a limitation into the code. So my advice would be that if you remove meaningful computation and stop there, you have done well. You can go further with processor-specific optimizations, but only if your target is a specific processor out of a sea of possibilities, or if you want to write processor-specific libraries because you are desparate for performance at any price.

                                D Offline
                                D Offline
                                David1987
                                wrote on last edited by
                                #37

                                In any case, the point was that "do less" is a gross oversimplification that doesn't even come close to actual optimization.

                                1 Reply Last reply
                                0
                                • D David1987

                                  Many of you have probably heard of it.. And it's false. It used to be true when CPU were simple in-order non-superscalar machines that did exactly what you told them in exactly that order. These days, it is still true there is a maximum number of instructions per second, but you almost never reach the theoretical maximum throughput. Using the full capacity of a CPU is hard work. So in order to go fast, doing less isn't the only option - in fact the fastest way may look like it's doing more. (or indeed in some cases actually do more) For example, when you're taking the sum of an array of integers, it is usually faster (it depends on the CPU micro-architecture of course) to use more than one accumulator. But that's not all. In order to fast these days, try to avoid cache misses. They're far more often the bottleneck than the CPU, and all of the above is only valid when your CPU actually is the bottleneck. Trying to avoid cache misses isn't really "doing less" either - a cache miss costs the same no matter how many bytes of the cache line you end up using, so often you can "do smarter" by reorganizing your data and/or trading cache misses for more instructions (which are far cheaper). To go fast, do smarter.

                                  Z Offline
                                  Z Offline
                                  Zot Williams
                                  wrote on last edited by
                                  #38

                                  Your arguments don't really prove the statement to be false. Sure, we can finish a task earlier ("faster") by splitting the work to be done over many ALUs or Cores. But this actually means each pathway/core is doing less work. It's simple division of labour. (Of course, making use of parallel pathways usually incurs overheads, so strictly speaking the code is actually running "slower", but completing earlier due to parallelism) In the case of caches, what can do to reduce misses? 1) Organise your data access patterns to make more efficient use of cache coherence. In other words, optimise your code so the cache/bus/memory systems have less work to do. 2) Reduce the size of the data being cached so more fits into a line of cache, thus making the cache more effective. So doing/using less to achieve the same task. As for "used to be true", none of what you have proposed is new! Parallel Processing has been used for years (A top example is Henry Ford using it to revolutionise the car manufacturing process, but of course 'many hands makes light work' is an ancient concept). Or if you want to be a bit more geeky, optimising cache/memory access has many similarities to optimising programs on Drum Memory (see Wikipedia). Different technology, same underlying problem.

                                  D 1 Reply Last reply
                                  0
                                  • Z Zot Williams

                                    Your arguments don't really prove the statement to be false. Sure, we can finish a task earlier ("faster") by splitting the work to be done over many ALUs or Cores. But this actually means each pathway/core is doing less work. It's simple division of labour. (Of course, making use of parallel pathways usually incurs overheads, so strictly speaking the code is actually running "slower", but completing earlier due to parallelism) In the case of caches, what can do to reduce misses? 1) Organise your data access patterns to make more efficient use of cache coherence. In other words, optimise your code so the cache/bus/memory systems have less work to do. 2) Reduce the size of the data being cached so more fits into a line of cache, thus making the cache more effective. So doing/using less to achieve the same task. As for "used to be true", none of what you have proposed is new! Parallel Processing has been used for years (A top example is Henry Ford using it to revolutionise the car manufacturing process, but of course 'many hands makes light work' is an ancient concept). Or if you want to be a bit more geeky, optimising cache/memory access has many similarities to optimising programs on Drum Memory (see Wikipedia). Different technology, same underlying problem.

                                    D Offline
                                    D Offline
                                    David1987
                                    wrote on last edited by
                                    #39

                                    All this "having less to do" is the result of having more code and executing more instructions, so it's just a simple as "doing less" - it's an example of "doing smarter". Using the cache better almost always requires more instructions.

                                    Zot Williams wrote:

                                    As for "used to be true", none of what you have proposed is new!

                                    Maybe not. It was true on the z80 and the 80286, for example. The time a piece of code would take was just the sum of the times of all instructions executed - doing less would directly result in doing it faster. So it used to be true. Fancy things such as out of order superscalar processing was something for main frames and supercomputers until the P5 made it mainstream.

                                    1 Reply Last reply
                                    0
                                    • D David1987

                                      There were processors on which SMC was required in order to do certain normal things, such as returning from a subroutine. More often, I've used SMC because things such as bit-tests would only take a constant, and having 8 copies would have been an unacceptable waste of space. Somewhat less often, I've had to resort to SMC like this:

                                      ld hl,AddressOfLock
                                      call _jphl ; jp (hl) at that label
                                      ..

                                      AddressOfLock:
                                      or a
                                      ld a,$C9 ; C9 = opcode for ret
                                      inc hl
                                      inc hl
                                      ld (hl),a
                                      scf
                                      ret

                                      Which is as far as I know the only way to do that atomically on a z80

                                      B Offline
                                      B Offline
                                      BrainiacV
                                      wrote on last edited by
                                      #40

                                      My Z-80 is a little rusty (last program I wrote in Z-80 code was the Bally Astrocade Biorhythm cartridge back in '80). But I had a friend look over your example and he came back with this comment/suggestion...


                                      Odd code. Z80 has bit tests and yes they only take a constant for the bit number. Plus a second value which is the register or address pointed to by the HL or pointed to by one of the 2 index registers + d (same as set & reset {res}). The Z (zero) flag indicates the state (1=zero else 0 - backwards I always thought.) An easier way to test any bit dynamically would be;

                                      LD b, value // Value with bit to be tested
                                      LD a, mask // b-value of bit(s) to be tested.
                                      CALL TEST
                                      :
                                      :
                                      TEST
                                      AND b // a will be zero if the mask bit was not set and 1 if the bit was set,
                                      // z=1 if a=0 else 0 p=1 if parity even else 0
                                      //c=0, h=1, n=0, s=0
                                      RET

                                      Psychosis at 10 Film at 11

                                      D 1 Reply Last reply
                                      0
                                      • B BrainiacV

                                        My Z-80 is a little rusty (last program I wrote in Z-80 code was the Bally Astrocade Biorhythm cartridge back in '80). But I had a friend look over your example and he came back with this comment/suggestion...


                                        Odd code. Z80 has bit tests and yes they only take a constant for the bit number. Plus a second value which is the register or address pointed to by the HL or pointed to by one of the 2 index registers + d (same as set & reset {res}). The Z (zero) flag indicates the state (1=zero else 0 - backwards I always thought.) An easier way to test any bit dynamically would be;

                                        LD b, value // Value with bit to be tested
                                        LD a, mask // b-value of bit(s) to be tested.
                                        CALL TEST
                                        :
                                        :
                                        TEST
                                        AND b // a will be zero if the mask bit was not set and 1 if the bit was set,
                                        // z=1 if a=0 else 0 p=1 if parity even else 0
                                        //c=0, h=1, n=0, s=0
                                        RET

                                        Psychosis at 10 Film at 11

                                        D Offline
                                        D Offline
                                        David1987
                                        wrote on last edited by
                                        #41

                                        Sure, except that inlining the AND is actually shorter (and faster). Of course this doesn't always quite work out, since you're destroying A. Or, if you're testing a bit on A, you're destroying some other register to hold the bitmask. There aren't that many registers.. Doing it the "easy" way could especially hurt in a loop, where ld a,mask and b is actually slower than a bit {self-modified-stuff} - just and b would be 1 cc faster but being able to miss A is not very common and extra loads and/or pops and pushes would ruin the performance even more. As for the oddness of the code, I can't disagree. It is decidedly odd, and doesn't work on any new processors. I know of no other way to atomically take a lock in the presence of NMI's though (if there are only normal interrupts, di/ei would work - and maybe some extra logic to only ei if interrupts were enabled)

                                        1 Reply Last reply
                                        0
                                        Reply
                                        • Reply as topic
                                        Log in to reply
                                        • Oldest to Newest
                                        • Newest to Oldest
                                        • Most Votes


                                        • Login

                                        • Don't have an account? Register

                                        • Login or register to search.
                                        • First post
                                          Last post
                                        0
                                        • Categories
                                        • Recent
                                        • Tags
                                        • Popular
                                        • World
                                        • Users
                                        • Groups