Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. 3 way byte merge

3 way byte merge

Scheduled Pinned Locked Moved C#
csharpc++graphicsjsonperformance
33 Posts 17 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J Offline
    J Offline
    jkohler
    wrote on last edited by
    #1

    Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

        private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
        {
            byte\* imgPtr = (byte\*)bmp.Scan0;
    
            //
            // single tasked version - one row at a time.....
    
            int h = bmp.Height;
            int w = bmp.Width;
            int s = bmp.Stride;
            byte\* b = (byte\*)\_b;
            byte\* g = (byte\*)\_g;
            byte\* r = (byte\*)\_r;
    
            for ( int row = 0; row < h; row++ ) {
                for ( int col = 0; col < w; col++ ) {
                    \*imgPtr++ = \*b++;
                    \*imgPtr++ = \*g++;
                    \*imgPtr++ = \*r++;
                }
                imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
            }
        }
    

    This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

    Subvert The Dominant Paradigm

    OriginalGriffO E C P N 14 Replies Last reply
    0
    • J jkohler

      Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

          private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
          {
              byte\* imgPtr = (byte\*)bmp.Scan0;
      
              //
              // single tasked version - one row at a time.....
      
              int h = bmp.Height;
              int w = bmp.Width;
              int s = bmp.Stride;
              byte\* b = (byte\*)\_b;
              byte\* g = (byte\*)\_g;
              byte\* r = (byte\*)\_r;
      
              for ( int row = 0; row < h; row++ ) {
                  for ( int col = 0; col < w; col++ ) {
                      \*imgPtr++ = \*b++;
                      \*imgPtr++ = \*g++;
                      \*imgPtr++ = \*r++;
                  }
                  imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
              }
          }
      

      This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

      Subvert The Dominant Paradigm

      OriginalGriffO Offline
      OriginalGriffO Offline
      OriginalGriff
      wrote on last edited by
      #2

      Give the compiler a clue and precalculate the alignment adjustment - it may or may not optimise it out of the loop. It would be worth experimenting with a 32bpp bitmap (ARGB) instead of a 24 (RGB) - that way you can assemble it in an Int32 and save only once - depending on your cache etc it may be faster to do one 32 bit write than three 8 bit ones. Rather than assembler, I would go to native C/C++ code first. That will probably be fast enough and a lot more maintainable.

      Real men don't use instructions. They are only the manufacturers opinion on how to put the thing together.

      "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
      "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt

      J 1 Reply Last reply
      0
      • OriginalGriffO OriginalGriff

        Give the compiler a clue and precalculate the alignment adjustment - it may or may not optimise it out of the loop. It would be worth experimenting with a 32bpp bitmap (ARGB) instead of a 24 (RGB) - that way you can assemble it in an Int32 and save only once - depending on your cache etc it may be faster to do one 32 bit write than three 8 bit ones. Rather than assembler, I would go to native C/C++ code first. That will probably be fast enough and a lot more maintainable.

        Real men don't use instructions. They are only the manufacturers opinion on how to put the thing together.

        J Offline
        J Offline
        jkohler
        wrote on last edited by
        #3

        Ahhhh... assemble into a single uint and then store. I'll try that. Thanks.

        Subvert The Dominant Paradigm

        1 Reply Last reply
        0
        • J jkohler

          Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

              private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
              {
                  byte\* imgPtr = (byte\*)bmp.Scan0;
          
                  //
                  // single tasked version - one row at a time.....
          
                  int h = bmp.Height;
                  int w = bmp.Width;
                  int s = bmp.Stride;
                  byte\* b = (byte\*)\_b;
                  byte\* g = (byte\*)\_g;
                  byte\* r = (byte\*)\_r;
          
                  for ( int row = 0; row < h; row++ ) {
                      for ( int col = 0; col < w; col++ ) {
                          \*imgPtr++ = \*b++;
                          \*imgPtr++ = \*g++;
                          \*imgPtr++ = \*r++;
                      }
                      imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
                  }
              }
          

          This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

          Subvert The Dominant Paradigm

          E Offline
          E Offline
          Ennis Ray Lynch Jr
          wrote on last edited by
          #4

          Almost the same code I used: Painless yet unsafe grayscale conversion in C#[^] About as fast as C# will get. C++ is must faster.

          Need custom software developed? I do custom programming based primarily on MS tools with an emphasis on C# development and consulting. I also do Android Programming as I find it a refreshing break from the MS. "And they, since they Were not the one dead, turned to their affairs" -- Robert Frost

          J 1 Reply Last reply
          0
          • E Ennis Ray Lynch Jr

            Almost the same code I used: Painless yet unsafe grayscale conversion in C#[^] About as fast as C# will get. C++ is must faster.

            Need custom software developed? I do custom programming based primarily on MS tools with an emphasis on C# development and consulting. I also do Android Programming as I find it a refreshing break from the MS. "And they, since they Were not the one dead, turned to their affairs" -- Robert Frost

            J Offline
            J Offline
            jkohler
            wrote on last edited by
            #5

            Ennis Ray Lynch, Jr. wrote:

            About as fast as C# will get. C++ is must faster.

            Ther's nothing like horsepower to cover lazy programmers. For giggles I tried this:

                    Parallel.For( 0, bmp.Height, row => {
                        //
                        // establish pointer into the three source color planes, each based on the the row this
                        // task is to process
                        byte\* b = (byte\*)\_b + ( row \* bmp.Width );
                        byte\* g = (byte\*)\_g + ( row \* bmp.Width );
                        byte\* r = (byte\*)\_r + ( row \* bmp.Width );
                        //
                        // calc the starting position in the destination bitmap for the first byte of this row
                        byte\* dst = row <= 0 ? (byte\*)bmp.Scan0 : (byte\*)bmp.Scan0 + ( ( bmp.Width \* 3 ) \* row ) + ( ( bmp.Stride / 3 ) - bmp.Width ) \* 3;
                        //
                        // copy the bytes from the three sources into the bitmap row
                        for ( int col = 0; col < bmp.Width; col++ ) {
                            \*dst++ = \*b++;
                            \*dst++ = \*g++;
                            \*dst++ = \*r++;
                        }
                    } );
            

            On my dual I9 it's pretty fast (well, reallllyyyy fast). Unfortunately that's not the production target.

            Subvert The Dominant Paradigm

            L 1 Reply Last reply
            0
            • J jkohler

              Ennis Ray Lynch, Jr. wrote:

              About as fast as C# will get. C++ is must faster.

              Ther's nothing like horsepower to cover lazy programmers. For giggles I tried this:

                      Parallel.For( 0, bmp.Height, row => {
                          //
                          // establish pointer into the three source color planes, each based on the the row this
                          // task is to process
                          byte\* b = (byte\*)\_b + ( row \* bmp.Width );
                          byte\* g = (byte\*)\_g + ( row \* bmp.Width );
                          byte\* r = (byte\*)\_r + ( row \* bmp.Width );
                          //
                          // calc the starting position in the destination bitmap for the first byte of this row
                          byte\* dst = row <= 0 ? (byte\*)bmp.Scan0 : (byte\*)bmp.Scan0 + ( ( bmp.Width \* 3 ) \* row ) + ( ( bmp.Stride / 3 ) - bmp.Width ) \* 3;
                          //
                          // copy the bytes from the three sources into the bitmap row
                          for ( int col = 0; col < bmp.Width; col++ ) {
                              \*dst++ = \*b++;
                              \*dst++ = \*g++;
                              \*dst++ = \*r++;
                          }
                      } );
              

              On my dual I9 it's pretty fast (well, reallllyyyy fast). Unfortunately that's not the production target.

              Subvert The Dominant Paradigm

              L Offline
              L Offline
              Lost User
              wrote on last edited by
              #6

              jkohler wrote:

              Unfortunately that's not the production target.

              What are you targeting, Core2?

              J 1 Reply Last reply
              0
              • L Lost User

                jkohler wrote:

                Unfortunately that's not the production target.

                What are you targeting, Core2?

                J Offline
                J Offline
                jkohler
                wrote on last edited by
                #7

                VIA C7 @ 1.8 MHz It does pretty well until images get large and they no longer fit in the onboard caches....

                Subvert The Dominant Paradigm

                L 1 Reply Last reply
                0
                • J jkohler

                  VIA C7 @ 1.8 MHz It does pretty well until images get large and they no longer fit in the onboard caches....

                  Subvert The Dominant Paradigm

                  L Offline
                  L Offline
                  Lost User
                  wrote on last edited by
                  #8

                  I don't know much about the architecture they use..

                  1 Reply Last reply
                  0
                  • J jkohler

                    Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

                        private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
                        {
                            byte\* imgPtr = (byte\*)bmp.Scan0;
                    
                            //
                            // single tasked version - one row at a time.....
                    
                            int h = bmp.Height;
                            int w = bmp.Width;
                            int s = bmp.Stride;
                            byte\* b = (byte\*)\_b;
                            byte\* g = (byte\*)\_g;
                            byte\* r = (byte\*)\_r;
                    
                            for ( int row = 0; row < h; row++ ) {
                                for ( int col = 0; col < w; col++ ) {
                                    \*imgPtr++ = \*b++;
                                    \*imgPtr++ = \*g++;
                                    \*imgPtr++ = \*r++;
                                }
                                imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
                            }
                        }
                    

                    This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

                    Subvert The Dominant Paradigm

                    C Offline
                    C Offline
                    Chris Trelawny Ross
                    wrote on last edited by
                    #9

                    One of the other replies suggested using a 32 bit destination - which means you've probably changed the 'next destination pixel' calculation - but if you're still working with 24 bit destination pixels, moving the ( ( s / 3 ) - w ) * 3 expression outside the loops may make a difference (although, the C# compiler may have already done this in an optimization step). Division is timeconsuming - and this whole expression is constant for all loop iterations.

                    J L 2 Replies Last reply
                    0
                    • C Chris Trelawny Ross

                      One of the other replies suggested using a 32 bit destination - which means you've probably changed the 'next destination pixel' calculation - but if you're still working with 24 bit destination pixels, moving the ( ( s / 3 ) - w ) * 3 expression outside the loops may make a difference (although, the C# compiler may have already done this in an optimization step). Division is timeconsuming - and this whole expression is constant for all loop iterations.

                      J Offline
                      J Offline
                      jkohler
                      wrote on last edited by
                      #10

                      I just tried the 32bpp image format which on my development system gives a ~5% improvement. Unfortunately, on the target system (VIA C7) it makes little measurable difference. I expect any gain in execution speed is consumed in the increase in bitmap size (adding an additional byte per pixel to a 1600x1200 image is a significant increase in terms of CPU cache, etc.). Moving the loop invariant outside the loop does indeed make a small difference when running in the debugger but release code? No difference at all. And interstingly the 32bpp format does not require that little calculation to be done at all. In case you're interested, the 32bpp image version:

                          private static unsafe void CopyColorPlanes32( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
                          {
                              int\* imgPtr = (int\*)bmp.Scan0;
                      
                              const int h = bmp.Height;
                              const int w = bmp.Width;
                              const int alphaValue = 0xff << 24; // opaque
                      
                              byte\* b = (byte\*)\_b;
                              byte\* g = (byte\*)\_g;
                              byte\* r = (byte\*)\_r;
                      
                              for ( int row = 0; row < h; row++ )
                                  for ( int col = 0; col < w; col++ ) 
                                      \*imgPtr++ = alphaValue | \*b++ | ( \*g++ << 8 ) | ( \*r++ << 16 );
                          }
                      

                      Subvert The Dominant Paradigm

                      J M 2 Replies Last reply
                      0
                      • C Chris Trelawny Ross

                        One of the other replies suggested using a 32 bit destination - which means you've probably changed the 'next destination pixel' calculation - but if you're still working with 24 bit destination pixels, moving the ( ( s / 3 ) - w ) * 3 expression outside the loops may make a difference (although, the C# compiler may have already done this in an optimization step). Division is timeconsuming - and this whole expression is constant for all loop iterations.

                        L Offline
                        L Offline
                        Lost User
                        wrote on last edited by
                        #11

                        It isn't a division. It's a constant division by k, which any sane compiler (even the .NET JIT compiler, though its sanity is debatable) turns into a multiplication with approximately 0x100000000 / k (and possibly a few extra instructions for correct signed rounding) or some other constant, depending on the data size.

                        C 1 Reply Last reply
                        0
                        • L Lost User

                          It isn't a division. It's a constant division by k, which any sane compiler (even the .NET JIT compiler, though its sanity is debatable) turns into a multiplication with approximately 0x100000000 / k (and possibly a few extra instructions for correct signed rounding) or some other constant, depending on the data size.

                          C Offline
                          C Offline
                          Chris Trelawny Ross
                          wrote on last edited by
                          #12

                          I hope it also does some scaling: 0x100000000 is a mighty big factor to introduce without the matching 32 bit right shift. :doh:

                          L 1 Reply Last reply
                          0
                          • C Chris Trelawny Ross

                            I hope it also does some scaling: 0x100000000 is a mighty big factor to introduce without the matching 32 bit right shift. :doh:

                            L Offline
                            L Offline
                            Lost User
                            wrote on last edited by
                            #13

                            You don't need to, you can just take the upper half (edx)

                            C 2 Replies Last reply
                            0
                            • L Lost User

                              You don't need to, you can just take the upper half (edx)

                              C Offline
                              C Offline
                              Chris Trelawny Ross
                              wrote on last edited by
                              #14

                              You got me there. :sigh: The last time I really looked at CPUs, registers and assembly language was a long time ago, in a galaxy far, far away (the Motorola 68k family, to be exact). And I never did get to know the Intel CPUs. By now I've been working at the C/C++/C# level for far too long and I've clearly got soft in the head - and didn't put 2 & 2 together ("64 bit CPUs" => 64 bit registers! :doh: )

                              L 1 Reply Last reply
                              0
                              • C Chris Trelawny Ross

                                You got me there. :sigh: The last time I really looked at CPUs, registers and assembly language was a long time ago, in a galaxy far, far away (the Motorola 68k family, to be exact). And I never did get to know the Intel CPUs. By now I've been working at the C/C++/C# level for far too long and I've clearly got soft in the head - and didn't put 2 & 2 together ("64 bit CPUs" => 64 bit registers! :doh: )

                                L Offline
                                L Offline
                                Lost User
                                wrote on last edited by
                                #15

                                There is that funny 64bit thing.. but.. x86 has always had a double-width mul :) (although it used to give almost no advantage compared to div) You can even multiply 2 64bit numbers and get a 128bit result (in rdx:rax)

                                1 Reply Last reply
                                0
                                • L Lost User

                                  You don't need to, you can just take the upper half (edx)

                                  C Offline
                                  C Offline
                                  Chris Trelawny Ross
                                  wrote on last edited by
                                  #16

                                  That pesky Intel chip. Always doing things by halves! ;P

                                  1 Reply Last reply
                                  0
                                  • J jkohler

                                    I just tried the 32bpp image format which on my development system gives a ~5% improvement. Unfortunately, on the target system (VIA C7) it makes little measurable difference. I expect any gain in execution speed is consumed in the increase in bitmap size (adding an additional byte per pixel to a 1600x1200 image is a significant increase in terms of CPU cache, etc.). Moving the loop invariant outside the loop does indeed make a small difference when running in the debugger but release code? No difference at all. And interstingly the 32bpp format does not require that little calculation to be done at all. In case you're interested, the 32bpp image version:

                                        private static unsafe void CopyColorPlanes32( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
                                        {
                                            int\* imgPtr = (int\*)bmp.Scan0;
                                    
                                            const int h = bmp.Height;
                                            const int w = bmp.Width;
                                            const int alphaValue = 0xff << 24; // opaque
                                    
                                            byte\* b = (byte\*)\_b;
                                            byte\* g = (byte\*)\_g;
                                            byte\* r = (byte\*)\_r;
                                    
                                            for ( int row = 0; row < h; row++ )
                                                for ( int col = 0; col < w; col++ ) 
                                                    \*imgPtr++ = alphaValue | \*b++ | ( \*g++ << 8 ) | ( \*r++ << 16 );
                                        }
                                    

                                    Subvert The Dominant Paradigm

                                    J Offline
                                    J Offline
                                    JonHarrison
                                    wrote on last edited by
                                    #17

                                    Have you thought of declaring a union of a int and four chars for imgPtr and allowing the compiler to effectively decided how to do the bit shifts when you assign r g and b ? Also decrementing loops are faster since they simply require a compare with 0 and loop instruction rather than loading two values comparing and then looping.

                                    J 1 Reply Last reply
                                    0
                                    • J jkohler

                                      I just tried the 32bpp image format which on my development system gives a ~5% improvement. Unfortunately, on the target system (VIA C7) it makes little measurable difference. I expect any gain in execution speed is consumed in the increase in bitmap size (adding an additional byte per pixel to a 1600x1200 image is a significant increase in terms of CPU cache, etc.). Moving the loop invariant outside the loop does indeed make a small difference when running in the debugger but release code? No difference at all. And interstingly the 32bpp format does not require that little calculation to be done at all. In case you're interested, the 32bpp image version:

                                          private static unsafe void CopyColorPlanes32( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
                                          {
                                              int\* imgPtr = (int\*)bmp.Scan0;
                                      
                                              const int h = bmp.Height;
                                              const int w = bmp.Width;
                                              const int alphaValue = 0xff << 24; // opaque
                                      
                                              byte\* b = (byte\*)\_b;
                                              byte\* g = (byte\*)\_g;
                                              byte\* r = (byte\*)\_r;
                                      
                                              for ( int row = 0; row < h; row++ )
                                                  for ( int col = 0; col < w; col++ ) 
                                                      \*imgPtr++ = alphaValue | \*b++ | ( \*g++ << 8 ) | ( \*r++ << 16 );
                                          }
                                      

                                      Subvert The Dominant Paradigm

                                      M Offline
                                      M Offline
                                      Michael B Hansen
                                      wrote on last edited by
                                      #18

                                      Although it propably won't give you much, you could try to optimize your for-loop in the following ways: - Instead of having nesting the 2 for-loops - you could try to have just one - Counting backwards - Use != instead of < - Use ++n instead of ++n for( int n=(h*w)-1; n!=-1; --n) for( int n=h*w; n!=0; --n) Although the compiler/JIT should already have done some the above optimizations, I have seen measurable improvements doing the above in certain areas. /Michael Edit: Modified the for-loop as you aren't using the 'n' for indexing.

                                      Need a 0 false positive SPAM filter? Try the free, industry leading spam filters from SPAMfighter

                                      E 1 Reply Last reply
                                      0
                                      • J jkohler

                                        Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

                                            private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
                                            {
                                                byte\* imgPtr = (byte\*)bmp.Scan0;
                                        
                                                //
                                                // single tasked version - one row at a time.....
                                        
                                                int h = bmp.Height;
                                                int w = bmp.Width;
                                                int s = bmp.Stride;
                                                byte\* b = (byte\*)\_b;
                                                byte\* g = (byte\*)\_g;
                                                byte\* r = (byte\*)\_r;
                                        
                                                for ( int row = 0; row < h; row++ ) {
                                                    for ( int col = 0; col < w; col++ ) {
                                                        \*imgPtr++ = \*b++;
                                                        \*imgPtr++ = \*g++;
                                                        \*imgPtr++ = \*r++;
                                                    }
                                                    imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
                                                }
                                            }
                                        

                                        This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

                                        Subvert The Dominant Paradigm

                                        P Offline
                                        P Offline
                                        philpalk
                                        wrote on last edited by
                                        #19

                                        I've faced the exact same issue a little while ago (targeting a different processor though). IIRC the big surprise for me was that I gained a significant performance increase by swapping from the for loop you have to the while loop below (must be friendlier to the C# optimizer). Something else you might try is to manually unroll your loop to do 4 pixels at a time in the inner loop and read your source image channels 32-bits at a time. This is definitely something where you'd benefit from dropping down to native code if the performance of this step is that critical (and if p/invoke proves to be significant you can implement it using a mixed mode assembly). Usually with machine vision though converting to a packed byte format is only done as a last step for displaying/storing the results, processing is usually done in planar formats (which I really wouldn't call 'proprietary' either btw) for better performance.

                                        private static unsafe void PlanarToPackedByteRgb32(
                                        int width, int height,
                                        IntPtr rSrc, IntPtr gSrc, IntPtr bSrc,
                                        IntPtr dest, int stride)
                                        {
                                        var rSrcPtr = (byte*)rSrc.ToPointer();
                                        var gSrcPtr = (byte*)gSrc.ToPointer();
                                        var bSrcPtr = (byte*)bSrc.ToPointer();
                                        var destPtr = (byte*)dest.ToPointer();
                                        var destEndPtr = destPtr + stride * height;
                                        var rowStep = 4 * width;

                                        while (destPtr != destEndPtr)
                                        {
                                            var it = (uint\*)destPtr;
                                            var end = (uint\*)(destPtr + rowStep);
                                            destPtr += stride;
                                        
                                            while (it != end)
                                            {
                                                \*it++ =
                                                    ((uint)(\*rSrcPtr++) << 16) |
                                                    ((uint)(\*gSrcPtr++) << 8) |
                                                    ((uint)(\*bSrcPtr++) << 0);
                                            }
                                        }
                                        

                                        }

                                        1 Reply Last reply
                                        0
                                        • J jkohler

                                          Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:

                                              private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r )
                                              {
                                                  byte\* imgPtr = (byte\*)bmp.Scan0;
                                          
                                                  //
                                                  // single tasked version - one row at a time.....
                                          
                                                  int h = bmp.Height;
                                                  int w = bmp.Width;
                                                  int s = bmp.Stride;
                                                  byte\* b = (byte\*)\_b;
                                                  byte\* g = (byte\*)\_g;
                                                  byte\* r = (byte\*)\_r;
                                          
                                                  for ( int row = 0; row < h; row++ ) {
                                                      for ( int col = 0; col < w; col++ ) {
                                                          \*imgPtr++ = \*b++;
                                                          \*imgPtr++ = \*g++;
                                                          \*imgPtr++ = \*r++;
                                                      }
                                                      imgPtr += ( ( s / 3 ) - w ) \* 3;  // ensures we're starting the row properly aligned
                                                  }
                                              }
                                          

                                          This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...

                                          Subvert The Dominant Paradigm

                                          N Offline
                                          N Offline
                                          NL PUR
                                          wrote on last edited by
                                          #20

                                          I wonder how fast your function is when you don't do anything within your 'for loops'

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Don't have an account? Register

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups