Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. 3 way byte merge

3 way byte merge

Scheduled Pinned Locked Moved C#
csharpc++graphicsjsonperformance
33 Posts 17 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J jkohler

    A union.... Apparently I've forgotten the basics. Thanks for the reminder - even if it doesn't shave any cycles it'll definitley have a higher cool factor. As to reversing the loop - your point is taken and appreciated.

    Subvert The Dominant Paradigm

    J Offline
    J Offline
    JonHarrison
    wrote on last edited by
    #22

    Write the loop as suggested by Michael. The processor might even have its own instructions for copying a byte from one register into a different byte in another (or even combining multiple byte copies into a single instruction). Look at the assembler for the bit shift and or solution and the union and see which is more efficient. I'm interested to hear how much faster it becomes.

    1 Reply Last reply
    0
    • J jkohler

      Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

          private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
          {
              byte\* imgPtr = (byte\*)bmp.Scan0;
      
              //
              // single tasked version - one row at a time.....
      
              int h = bmp.Height;
              int w = bmp.Width;
              int s = bmp.Stride;
              byte\* b = (byte\*)\_b;
              byte\* g = (byte\*)\_g;
              byte\* r = (byte\*)\_r;
      
              for ( int row = 0; row < h; row++ ) {
                  for ( int col = 0; col < w; col++ ) {
                      \*imgPtr++ = \*b++;
                      \*imgPtr++ = \*g++;
                      \*imgPtr++ = \*r++;
                  }
                  imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
              }
          }
      

      This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

      Subvert The Dominant Paradigm

      J Offline
      J Offline
      jkohler
      wrote on last edited by
      #23

      To all, Thank you one and all for your responses. Of all the suggestions the one with the largest impact on speed is the one going to the 32bpp format. This reduced the conversion time from ~50ms to ~42ms but the extra 1.9Mb required for each image is not (in my particular case) a good trade off. The other suggestions resulted in 1ms or maybe 2ms improvements with no single technique showing a clear improvement. This project involves inspecting components in trays - there may be up to 4 images per component with (so far) a max of 52 components per tray. All these images need to be available to the operator at "a touch of the screen". With this many images (each is 1600x1200) I really just need to dust off the ol' C/ASM skills and convert to a 16bpp format - gaining 2Mb per image in the process. Again, thanks for the suggestions.

      Subvert The Dominant Paradigm

      1 Reply Last reply
      0
      • J jkohler

        Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

            private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
            {
                byte\* imgPtr = (byte\*)bmp.Scan0;
        
                //
                // single tasked version - one row at a time.....
        
                int h = bmp.Height;
                int w = bmp.Width;
                int s = bmp.Stride;
                byte\* b = (byte\*)\_b;
                byte\* g = (byte\*)\_g;
                byte\* r = (byte\*)\_r;
        
                for ( int row = 0; row < h; row++ ) {
                    for ( int col = 0; col < w; col++ ) {
                        \*imgPtr++ = \*b++;
                        \*imgPtr++ = \*g++;
                        \*imgPtr++ = \*r++;
                    }
                    imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
                }
            }
        

        This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

        Subvert The Dominant Paradigm

        M Offline
        M Offline
        Michael Kingsford Gray
        wrote on last edited by
        #24

        Buy a fast memory adapter that pretends to be a disk. Problem solved without any coding.

        1 Reply Last reply
        0
        • J jkohler

          Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

              private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
              {
                  byte\* imgPtr = (byte\*)bmp.Scan0;
          
                  //
                  // single tasked version - one row at a time.....
          
                  int h = bmp.Height;
                  int w = bmp.Width;
                  int s = bmp.Stride;
                  byte\* b = (byte\*)\_b;
                  byte\* g = (byte\*)\_g;
                  byte\* r = (byte\*)\_r;
          
                  for ( int row = 0; row < h; row++ ) {
                      for ( int col = 0; col < w; col++ ) {
                          \*imgPtr++ = \*b++;
                          \*imgPtr++ = \*g++;
                          \*imgPtr++ = \*r++;
                      }
                      imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
                  }
              }
          

          This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

          Subvert The Dominant Paradigm

          P Offline
          P Offline
          patbob
          wrote on last edited by
          #25

          Loop unrolling in the form of parallel loop execution would help if it were C/C++. You'd need to use array indexing into the rows rather than incrementing pointers. Reading the data in large data-cache-line sized chunks would also be a win in C/C++. The instructions to mask and shift can fit in the instruction cache and they will absolutely scream though the data being no longer RAM-limited. Maybe the MMX registers can be applied here? Writing the output data as uncached data-cache-line sized chunks also helps in C/C++ on some CPU architectures. Not writing back through the CPU data cache helps keep the input data in it (= fewer misses). Unfortunately, to get to this level of tuning the code, you need to be generating native instructions where you have some chance or being able to predict what the CPU and its caches will be chewing on at any given moment. Running inside the C# interpreter you're probably SOL since you have neither any idea nor control of what native code is getting executed. BTW, that operation on a 2MP image takes no more than a couple of milliseconds in our C++ code.. and that's without applying all the fancy tricks I mentioned above. You best bet is probably to just bite the bullet and thunk through to a native language for the high performance work to get the ~20X speed improvement -- that's what I'd do unless I was in a mood to learn something about what kinds of performance C# can be made to do.

          patbob

          J 1 Reply Last reply
          0
          • J jkohler

            Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

                private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
                {
                    byte\* imgPtr = (byte\*)bmp.Scan0;
            
                    //
                    // single tasked version - one row at a time.....
            
                    int h = bmp.Height;
                    int w = bmp.Width;
                    int s = bmp.Stride;
                    byte\* b = (byte\*)\_b;
                    byte\* g = (byte\*)\_g;
                    byte\* r = (byte\*)\_r;
            
                    for ( int row = 0; row < h; row++ ) {
                        for ( int col = 0; col < w; col++ ) {
                            \*imgPtr++ = \*b++;
                            \*imgPtr++ = \*g++;
                            \*imgPtr++ = \*r++;
                        }
                        imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
                    }
                }
            

            This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

            Subvert The Dominant Paradigm

            E Offline
            E Offline
            englebart
            wrote on last edited by
            #26

            Don't use byte pointers for reading the source colors. Maintain uint* or stretch to ulong* instead of using byte*. Precalculate the single loop count, width*height/4 or /8 if you use ulong. In the loop calculate 4 or 8 bytes at a time

            int work;

            begin loop
            int tempRed = *_r++; // you might also try int& tempRed = *_r++;
            //ditto for other colors
            // I am pretty sure that dotNet has strict endian rules, so this should be safe.
            *imgPtr++ = alpha | ((tempRed & 0x000000FF) << 16) | ((tempGreen & 0x000000FF) << 8) | ((tempBlue & 0x000000FF));
            *imgPtr++ = alpha | ((tempRed & 0x0000FF00) << 8) | ((tempGreen & 0x0000FF00)) | ((tempBlue & 0x0000FF00) >> 8);
            *imgPtr++ = alpha | ((tempRed & 0x00FF0000) ) | ((tempGreen & 0x00FF0000) >> 8) | ((tempBlue & 0xFF0000) >> 16);
            *imgPtr++ = alpha | ((tempRed & 0xFF000000) >> 8) | ((tempGreen & 0xFF000000) >> 16) | ((tempBlue & 0xFF000000) >> 24);
            end loop

            You will need some after-loop checks that perform the same logic for stragglers. (modulus 4 or 8).
            switch(leftover_modulus) {
            case 3:
            *imgPtr++ = alpha | ((tempRed & 0x00FF0000) ) | ((tempGreen & 0xFF0000) >> 8) | ((tempBlue & 0xFF0000) >> 16);
            // fall thru
            case 2:
            ...
            // fall thru
            case 1:
            ...
            break;
            default:
            // nothing to do if 0 since it would have been handled in the loop
            }

            1 Reply Last reply
            0
            • P patbob

              Loop unrolling in the form of parallel loop execution would help if it were C/C++. You'd need to use array indexing into the rows rather than incrementing pointers. Reading the data in large data-cache-line sized chunks would also be a win in C/C++. The instructions to mask and shift can fit in the instruction cache and they will absolutely scream though the data being no longer RAM-limited. Maybe the MMX registers can be applied here? Writing the output data as uncached data-cache-line sized chunks also helps in C/C++ on some CPU architectures. Not writing back through the CPU data cache helps keep the input data in it (= fewer misses). Unfortunately, to get to this level of tuning the code, you need to be generating native instructions where you have some chance or being able to predict what the CPU and its caches will be chewing on at any given moment. Running inside the C# interpreter you're probably SOL since you have neither any idea nor control of what native code is getting executed. BTW, that operation on a 2MP image takes no more than a couple of milliseconds in our C++ code.. and that's without applying all the fancy tricks I mentioned above. You best bet is probably to just bite the bullet and thunk through to a native language for the high performance work to get the ~20X speed improvement -- that's what I'd do unless I was in a mood to learn something about what kinds of performance C# can be made to do.

              patbob

              J Offline
              J Offline
              jkohler
              wrote on last edited by
              #27

              Thanks for the thoughts - definitely on the mark. I've already decided to do the C/C++ approach (maybe MMX if time allows) - it's been a few years since I've done any bit bangin' like this. It's actually fun.

              Subvert The Dominant Paradigm

              1 Reply Last reply
              0
              • M Michael B Hansen

                Although it propably won't give you much, you could try to optimize your for-loop in the following ways: - Instead of having nesting the 2 for-loops - you could try to have just one - Counting backwards - Use != instead of < - Use ++n instead of ++n for( int n=(h*w)-1; n!=-1; --n) for( int n=h*w; n!=0; --n) Although the compiler/JIT should already have done some the above optimizations, I have seen measurable improvements doing the above in certain areas. /Michael Edit: Modified the for-loop as you aren't using the 'n' for indexing.

                Need a 0 false positive SPAM filter? Try the free, industry leading spam filters from SPAMfighter

                E Offline
                E Offline
                ely_bob
                wrote on last edited by
                #28

                Michael B. Hansen wrote:

                Although the compiler/JIT should already have done some the above optimizations, I have seen measurable improvements doing the above in certain areas.

                As Have I... And although it is ugly, you may want to look @ the target environment architecture and optimize the loop (unroll manually) to make a write equal to the available size of register, this should boost speed a Lot.! -I have a related post.. not to toot my own horn... but TOOT. :laugh:

                I'd blame it on the Brain farts.. But let's be honest, it really is more like a Methane factory between my ears some days then it is anything else... -"The conversations he was having with himself were becoming ominous."-.. On the radio...

                1 Reply Last reply
                0
                • J jkohler

                  Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

                      private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
                      {
                          byte\* imgPtr = (byte\*)bmp.Scan0;
                  
                          //
                          // single tasked version - one row at a time.....
                  
                          int h = bmp.Height;
                          int w = bmp.Width;
                          int s = bmp.Stride;
                          byte\* b = (byte\*)\_b;
                          byte\* g = (byte\*)\_g;
                          byte\* r = (byte\*)\_r;
                  
                          for ( int row = 0; row < h; row++ ) {
                              for ( int col = 0; col < w; col++ ) {
                                  \*imgPtr++ = \*b++;
                                  \*imgPtr++ = \*g++;
                                  \*imgPtr++ = \*r++;
                              }
                              imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
                          }
                      }
                  

                  This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

                  Subvert The Dominant Paradigm

                  J Offline
                  J Offline
                  JesperMadsen123
                  wrote on last edited by
                  #29

                  Some ideas of the top of my head.. Can you interleave the input data for the function _b,_g,_r so they might fit caches better? Maybe you can unroll some of the loop, iterate in 12 bytes at a time 4 * 3, and store as 3 uint* operations... The last of the column w % 4 (i think) must be performed by your current loop (between 0 and 3 operations).

                  1 Reply Last reply
                  0
                  • J jkohler

                    Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

                        private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
                        {
                            byte\* imgPtr = (byte\*)bmp.Scan0;
                    
                            //
                            // single tasked version - one row at a time.....
                    
                            int h = bmp.Height;
                            int w = bmp.Width;
                            int s = bmp.Stride;
                            byte\* b = (byte\*)\_b;
                            byte\* g = (byte\*)\_g;
                            byte\* r = (byte\*)\_r;
                    
                            for ( int row = 0; row < h; row++ ) {
                                for ( int col = 0; col < w; col++ ) {
                                    \*imgPtr++ = \*b++;
                                    \*imgPtr++ = \*g++;
                                    \*imgPtr++ = \*r++;
                                }
                                imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
                            }
                        }
                    

                    This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

                    Subvert The Dominant Paradigm

                    S Offline
                    S Offline
                    starmerak
                    wrote on last edited by
                    #30

                    The first I see is that ( ( s / 3 ) - w ) * 3; is the same as s - w * 3; although the suggestion to move this outside the loop will save you more...

                    1 Reply Last reply
                    0
                    • J jkohler

                      Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

                          private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
                          {
                              byte\* imgPtr = (byte\*)bmp.Scan0;
                      
                              //
                              // single tasked version - one row at a time.....
                      
                              int h = bmp.Height;
                              int w = bmp.Width;
                              int s = bmp.Stride;
                              byte\* b = (byte\*)\_b;
                              byte\* g = (byte\*)\_g;
                              byte\* r = (byte\*)\_r;
                      
                              for ( int row = 0; row < h; row++ ) {
                                  for ( int col = 0; col < w; col++ ) {
                                      \*imgPtr++ = \*b++;
                                      \*imgPtr++ = \*g++;
                                      \*imgPtr++ = \*r++;
                                  }
                                  imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
                              }
                          }
                      

                      This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

                      Subvert The Dominant Paradigm

                      O Offline
                      O Offline
                      oggenok64
                      wrote on last edited by
                      #31

                      Looks like an ideal case for some loop unrolling. If say bmp.Width is always even, the inner col loop could be rewritten as

                      for ( int col = 0; col < w; col += 2 ) {
                      *imgPtr++ = *b++;
                      *imgPtr++ = *g++;
                      *imgPtr++ = *r++;
                      *imgPtr++ = *b++;
                      *imgPtr++ = *g++;
                      *imgPtr++ = *r++;
                      }

                      saving half the overhead of loop management. C# doesn't allow fallthrough in switch statements, otherwise Duffs device would be perfect here. - turin

                      1 Reply Last reply
                      0
                      • J jkohler

                        Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

                            private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
                            {
                                byte\* imgPtr = (byte\*)bmp.Scan0;
                        
                                //
                                // single tasked version - one row at a time.....
                        
                                int h = bmp.Height;
                                int w = bmp.Width;
                                int s = bmp.Stride;
                                byte\* b = (byte\*)\_b;
                                byte\* g = (byte\*)\_g;
                                byte\* r = (byte\*)\_r;
                        
                                for ( int row = 0; row < h; row++ ) {
                                    for ( int col = 0; col < w; col++ ) {
                                        \*imgPtr++ = \*b++;
                                        \*imgPtr++ = \*g++;
                                        \*imgPtr++ = \*r++;
                                    }
                                    imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
                                }
                            }
                        

                        This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

                        Subvert The Dominant Paradigm

                        E Offline
                        E Offline
                        englebart
                        wrote on last edited by
                        #32

                        Another approach. Some of the other posts got this idea kicking... Have one outer loop based on a block size that tries to optimize read and write cache sizes. By working with a single color at a time you should optimize read hits. A simple starting point of blocksize = 1 should be close to what you have now. You are trading read cache hits for loop overhead. The correct block size might swing this in your favor. loop totalsize/blocksize output pointer = (set based on block size) loop blocksize(red) read one byte from red pointer (pointer += 1) write one byte to output pointer (pointer +=3) endloop blocksize(red) output pointer = (set based on block size) + 1 offset to skip red loop blocksize(green) read one byte from green pointer (pointer += 1) write one byte to output pointer (pointer +=3) endloop blocksize(green) output pointer = (set based on block size) + 2 offset to skip red and green loop blocksize(blue) read one byte from blue pointer (pointer += 1) write one byte to output pointer (pointer +=3) endloop blocksize(blue) endloop totalsize/blocksize Variations: have the input pointers start at different offsets (thirds) within the blocksize and wrap at the end. This will smooth write conflicts on the output pointer. apply one thread per color - you would want these threads to be preallocated and dedicated to the blit engine

                        1 Reply Last reply
                        0
                        • J jkohler

                          Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

                              private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
                              {
                                  byte\* imgPtr = (byte\*)bmp.Scan0;
                          
                                  //
                                  // single tasked version - one row at a time.....
                          
                                  int h = bmp.Height;
                                  int w = bmp.Width;
                                  int s = bmp.Stride;
                                  byte\* b = (byte\*)\_b;
                                  byte\* g = (byte\*)\_g;
                                  byte\* r = (byte\*)\_r;
                          
                                  for ( int row = 0; row < h; row++ ) {
                                      for ( int col = 0; col < w; col++ ) {
                                          \*imgPtr++ = \*b++;
                                          \*imgPtr++ = \*g++;
                                          \*imgPtr++ = \*r++;
                                      }
                                      imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
                                  }
                              }
                          

                          This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

                          Subvert The Dominant Paradigm

                          T Offline
                          T Offline
                          Tim Yen
                          wrote on last edited by
                          #33

                          Why dont you do a BlockCopy using the Buffer class? [^] [^]

                          1 Reply Last reply
                          0
                          Reply
                          • Reply as topic
                          Log in to reply
                          • Oldest to Newest
                          • Newest to Oldest
                          • Most Votes


                          • Login

                          • Don't have an account? Register

                          • Login or register to search.
                          • First post
                            Last post
                          0
                          • Categories
                          • Recent
                          • Tags
                          • Popular
                          • World
                          • Users
                          • Groups