3 way byte merge
-
VIA C7 @ 1.8 MHz It does pretty well until images get large and they no longer fit in the onboard caches....
Subvert The Dominant Paradigm
-
Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:
private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r ) { byte\* imgPtr = (byte\*)bmp.Scan0; // // single tasked version - one row at a time..... int h = bmp.Height; int w = bmp.Width; int s = bmp.Stride; byte\* b = (byte\*)\_b; byte\* g = (byte\*)\_g; byte\* r = (byte\*)\_r; for ( int row = 0; row < h; row++ ) { for ( int col = 0; col < w; col++ ) { \*imgPtr++ = \*b++; \*imgPtr++ = \*g++; \*imgPtr++ = \*r++; } imgPtr += ( ( s / 3 ) - w ) \* 3; // ensures we're starting the row properly aligned } }
This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...
Subvert The Dominant Paradigm
One of the other replies suggested using a 32 bit destination - which means you've probably changed the 'next destination pixel' calculation - but if you're still working with 24 bit destination pixels, moving the
( ( s / 3 ) - w ) * 3
expression outside the loops may make a difference (although, the C# compiler may have already done this in an optimization step). Division is timeconsuming - and this whole expression is constant for all loop iterations. -
One of the other replies suggested using a 32 bit destination - which means you've probably changed the 'next destination pixel' calculation - but if you're still working with 24 bit destination pixels, moving the
( ( s / 3 ) - w ) * 3
expression outside the loops may make a difference (although, the C# compiler may have already done this in an optimization step). Division is timeconsuming - and this whole expression is constant for all loop iterations.I just tried the 32bpp image format which on my development system gives a ~5% improvement. Unfortunately, on the target system (VIA C7) it makes little measurable difference. I expect any gain in execution speed is consumed in the increase in bitmap size (adding an additional byte per pixel to a 1600x1200 image is a significant increase in terms of CPU cache, etc.). Moving the loop invariant outside the loop does indeed make a small difference when running in the debugger but release code? No difference at all. And interstingly the 32bpp format does not require that little calculation to be done at all. In case you're interested, the 32bpp image version:
private static unsafe void CopyColorPlanes32( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r ) { int\* imgPtr = (int\*)bmp.Scan0; const int h = bmp.Height; const int w = bmp.Width; const int alphaValue = 0xff << 24; // opaque byte\* b = (byte\*)\_b; byte\* g = (byte\*)\_g; byte\* r = (byte\*)\_r; for ( int row = 0; row < h; row++ ) for ( int col = 0; col < w; col++ ) \*imgPtr++ = alphaValue | \*b++ | ( \*g++ << 8 ) | ( \*r++ << 16 ); }
Subvert The Dominant Paradigm
-
One of the other replies suggested using a 32 bit destination - which means you've probably changed the 'next destination pixel' calculation - but if you're still working with 24 bit destination pixels, moving the
( ( s / 3 ) - w ) * 3
expression outside the loops may make a difference (although, the C# compiler may have already done this in an optimization step). Division is timeconsuming - and this whole expression is constant for all loop iterations.It isn't a division. It's a constant division by
k
, which any sane compiler (even the .NET JIT compiler, though its sanity is debatable) turns into a multiplication with approximately0x100000000 / k
(and possibly a few extra instructions for correct signed rounding) or some other constant, depending on the data size. -
It isn't a division. It's a constant division by
k
, which any sane compiler (even the .NET JIT compiler, though its sanity is debatable) turns into a multiplication with approximately0x100000000 / k
(and possibly a few extra instructions for correct signed rounding) or some other constant, depending on the data size.I hope it also does some scaling: 0x100000000 is a mighty big factor to introduce without the matching 32 bit right shift. :doh:
-
I hope it also does some scaling: 0x100000000 is a mighty big factor to introduce without the matching 32 bit right shift. :doh:
-
You got me there. :sigh: The last time I really looked at CPUs, registers and assembly language was a long time ago, in a galaxy far, far away (the Motorola 68k family, to be exact). And I never did get to know the Intel CPUs. By now I've been working at the C/C++/C# level for far too long and I've clearly got soft in the head - and didn't put 2 & 2 together ("64 bit CPUs" => 64 bit registers! :doh: )
-
You got me there. :sigh: The last time I really looked at CPUs, registers and assembly language was a long time ago, in a galaxy far, far away (the Motorola 68k family, to be exact). And I never did get to know the Intel CPUs. By now I've been working at the C/C++/C# level for far too long and I've clearly got soft in the head - and didn't put 2 & 2 together ("64 bit CPUs" => 64 bit registers! :doh: )
-
That pesky Intel chip. Always doing things by halves! ;P
-
I just tried the 32bpp image format which on my development system gives a ~5% improvement. Unfortunately, on the target system (VIA C7) it makes little measurable difference. I expect any gain in execution speed is consumed in the increase in bitmap size (adding an additional byte per pixel to a 1600x1200 image is a significant increase in terms of CPU cache, etc.). Moving the loop invariant outside the loop does indeed make a small difference when running in the debugger but release code? No difference at all. And interstingly the 32bpp format does not require that little calculation to be done at all. In case you're interested, the 32bpp image version:
private static unsafe void CopyColorPlanes32( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r ) { int\* imgPtr = (int\*)bmp.Scan0; const int h = bmp.Height; const int w = bmp.Width; const int alphaValue = 0xff << 24; // opaque byte\* b = (byte\*)\_b; byte\* g = (byte\*)\_g; byte\* r = (byte\*)\_r; for ( int row = 0; row < h; row++ ) for ( int col = 0; col < w; col++ ) \*imgPtr++ = alphaValue | \*b++ | ( \*g++ << 8 ) | ( \*r++ << 16 ); }
Subvert The Dominant Paradigm
Although it propably won't give you much, you could try to optimize your for-loop in the following ways: - Instead of having nesting the 2 for-loops - you could try to have just one - Counting backwards - Use != instead of < - Use ++n instead of ++n for( int n=(h*w)-1; n!=-1; --n) for( int n=h*w; n!=0; --n) Although the compiler/JIT should already have done some the above optimizations, I have seen measurable improvements doing the above in certain areas. /Michael Edit: Modified the for-loop as you aren't using the 'n' for indexing.
Need a 0 false positive SPAM filter? Try the free, industry leading spam filters from SPAMfighter
-
I just tried the 32bpp image format which on my development system gives a ~5% improvement. Unfortunately, on the target system (VIA C7) it makes little measurable difference. I expect any gain in execution speed is consumed in the increase in bitmap size (adding an additional byte per pixel to a 1600x1200 image is a significant increase in terms of CPU cache, etc.). Moving the loop invariant outside the loop does indeed make a small difference when running in the debugger but release code? No difference at all. And interstingly the 32bpp format does not require that little calculation to be done at all. In case you're interested, the 32bpp image version:
private static unsafe void CopyColorPlanes32( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r ) { int\* imgPtr = (int\*)bmp.Scan0; const int h = bmp.Height; const int w = bmp.Width; const int alphaValue = 0xff << 24; // opaque byte\* b = (byte\*)\_b; byte\* g = (byte\*)\_g; byte\* r = (byte\*)\_r; for ( int row = 0; row < h; row++ ) for ( int col = 0; col < w; col++ ) \*imgPtr++ = alphaValue | \*b++ | ( \*g++ << 8 ) | ( \*r++ << 16 ); }
Subvert The Dominant Paradigm
Have you thought of declaring a union of a int and four chars for imgPtr and allowing the compiler to effectively decided how to do the bit shifts when you assign r g and b ? Also decrementing loops are faster since they simply require a compare with 0 and loop instruction rather than loading two values comparing and then looping.
-
Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:
private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r ) { byte\* imgPtr = (byte\*)bmp.Scan0; // // single tasked version - one row at a time..... int h = bmp.Height; int w = bmp.Width; int s = bmp.Stride; byte\* b = (byte\*)\_b; byte\* g = (byte\*)\_g; byte\* r = (byte\*)\_r; for ( int row = 0; row < h; row++ ) { for ( int col = 0; col < w; col++ ) { \*imgPtr++ = \*b++; \*imgPtr++ = \*g++; \*imgPtr++ = \*r++; } imgPtr += ( ( s / 3 ) - w ) \* 3; // ensures we're starting the row properly aligned } }
This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...
Subvert The Dominant Paradigm
I've faced the exact same issue a little while ago (targeting a different processor though). IIRC the big surprise for me was that I gained a significant performance increase by swapping from the for loop you have to the while loop below (must be friendlier to the C# optimizer). Something else you might try is to manually unroll your loop to do 4 pixels at a time in the inner loop and read your source image channels 32-bits at a time. This is definitely something where you'd benefit from dropping down to native code if the performance of this step is that critical (and if p/invoke proves to be significant you can implement it using a mixed mode assembly). Usually with machine vision though converting to a packed byte format is only done as a last step for displaying/storing the results, processing is usually done in planar formats (which I really wouldn't call 'proprietary' either btw) for better performance.
private static unsafe void PlanarToPackedByteRgb32(
int width, int height,
IntPtr rSrc, IntPtr gSrc, IntPtr bSrc,
IntPtr dest, int stride)
{
var rSrcPtr = (byte*)rSrc.ToPointer();
var gSrcPtr = (byte*)gSrc.ToPointer();
var bSrcPtr = (byte*)bSrc.ToPointer();
var destPtr = (byte*)dest.ToPointer();
var destEndPtr = destPtr + stride * height;
var rowStep = 4 * width;while (destPtr != destEndPtr) { var it = (uint\*)destPtr; var end = (uint\*)(destPtr + rowStep); destPtr += stride; while (it != end) { \*it++ = ((uint)(\*rSrcPtr++) << 16) | ((uint)(\*gSrcPtr++) << 8) | ((uint)(\*bSrcPtr++) << 0); } }
}
-
Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:
private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r ) { byte\* imgPtr = (byte\*)bmp.Scan0; // // single tasked version - one row at a time..... int h = bmp.Height; int w = bmp.Width; int s = bmp.Stride; byte\* b = (byte\*)\_b; byte\* g = (byte\*)\_g; byte\* r = (byte\*)\_r; for ( int row = 0; row < h; row++ ) { for ( int col = 0; col < w; col++ ) { \*imgPtr++ = \*b++; \*imgPtr++ = \*g++; \*imgPtr++ = \*r++; } imgPtr += ( ( s / 3 ) - w ) \* 3; // ensures we're starting the row properly aligned } }
This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...
Subvert The Dominant Paradigm
-
Have you thought of declaring a union of a int and four chars for imgPtr and allowing the compiler to effectively decided how to do the bit shifts when you assign r g and b ? Also decrementing loops are faster since they simply require a compare with 0 and loop instruction rather than loading two values comparing and then looping.
-
A union.... Apparently I've forgotten the basics. Thanks for the reminder - even if it doesn't shave any cycles it'll definitley have a higher cool factor. As to reversing the loop - your point is taken and appreciated.
Subvert The Dominant Paradigm
Write the loop as suggested by Michael. The processor might even have its own instructions for copying a byte from one register into a different byte in another (or even combining multiple byte copies into a single instruction). Look at the assembler for the bit shift and or solution and the union and see which is more efficient. I'm interested to hear how much faster it becomes.
-
Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:
private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r ) { byte\* imgPtr = (byte\*)bmp.Scan0; // // single tasked version - one row at a time..... int h = bmp.Height; int w = bmp.Width; int s = bmp.Stride; byte\* b = (byte\*)\_b; byte\* g = (byte\*)\_g; byte\* r = (byte\*)\_r; for ( int row = 0; row < h; row++ ) { for ( int col = 0; col < w; col++ ) { \*imgPtr++ = \*b++; \*imgPtr++ = \*g++; \*imgPtr++ = \*r++; } imgPtr += ( ( s / 3 ) - w ) \* 3; // ensures we're starting the row properly aligned } }
This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...
Subvert The Dominant Paradigm
To all, Thank you one and all for your responses. Of all the suggestions the one with the largest impact on speed is the one going to the 32bpp format. This reduced the conversion time from ~50ms to ~42ms but the extra 1.9Mb required for each image is not (in my particular case) a good trade off. The other suggestions resulted in 1ms or maybe 2ms improvements with no single technique showing a clear improvement. This project involves inspecting components in trays - there may be up to 4 images per component with (so far) a max of 52 components per tray. All these images need to be available to the operator at "a touch of the screen". With this many images (each is 1600x1200) I really just need to dust off the ol' C/ASM skills and convert to a 16bpp format - gaining 2Mb per image in the process. Again, thanks for the suggestions.
Subvert The Dominant Paradigm
-
Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:
private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r ) { byte\* imgPtr = (byte\*)bmp.Scan0; // // single tasked version - one row at a time..... int h = bmp.Height; int w = bmp.Width; int s = bmp.Stride; byte\* b = (byte\*)\_b; byte\* g = (byte\*)\_g; byte\* r = (byte\*)\_r; for ( int row = 0; row < h; row++ ) { for ( int col = 0; col < w; col++ ) { \*imgPtr++ = \*b++; \*imgPtr++ = \*g++; \*imgPtr++ = \*r++; } imgPtr += ( ( s / 3 ) - w ) \* 3; // ensures we're starting the row properly aligned } }
This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...
Subvert The Dominant Paradigm
Buy a fast memory adapter that pretends to be a disk. Problem solved without any coding.
-
Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:
private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r ) { byte\* imgPtr = (byte\*)bmp.Scan0; // // single tasked version - one row at a time..... int h = bmp.Height; int w = bmp.Width; int s = bmp.Stride; byte\* b = (byte\*)\_b; byte\* g = (byte\*)\_g; byte\* r = (byte\*)\_r; for ( int row = 0; row < h; row++ ) { for ( int col = 0; col < w; col++ ) { \*imgPtr++ = \*b++; \*imgPtr++ = \*g++; \*imgPtr++ = \*r++; } imgPtr += ( ( s / 3 ) - w ) \* 3; // ensures we're starting the row properly aligned } }
This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...
Subvert The Dominant Paradigm
Loop unrolling in the form of parallel loop execution would help if it were C/C++. You'd need to use array indexing into the rows rather than incrementing pointers. Reading the data in large data-cache-line sized chunks would also be a win in C/C++. The instructions to mask and shift can fit in the instruction cache and they will absolutely scream though the data being no longer RAM-limited. Maybe the MMX registers can be applied here? Writing the output data as uncached data-cache-line sized chunks also helps in C/C++ on some CPU architectures. Not writing back through the CPU data cache helps keep the input data in it (= fewer misses). Unfortunately, to get to this level of tuning the code, you need to be generating native instructions where you have some chance or being able to predict what the CPU and its caches will be chewing on at any given moment. Running inside the C# interpreter you're probably SOL since you have neither any idea nor control of what native code is getting executed. BTW, that operation on a 2MP image takes no more than a couple of milliseconds in our C++ code.. and that's without applying all the fancy tricks I mentioned above. You best bet is probably to just bite the bullet and thunk through to a native language for the high performance work to get the ~20X speed improvement -- that's what I'd do unless I was in a mood to learn something about what kinds of performance C# can be made to do.
patbob
-
Hi all, I'm working with a machine vision package which uses a proprietary internal image format. Their package will convert to many common formats (jpg/bmp/png/etc) but this conversion is only done to disk and I want to do an in-memory conversion because the to-disk conversion is just plain too slow (>200ms). Their format for 8bit color images is 3 separate planes (R,G&B) which easily combine into a 24bpp BitMap like this:
private static unsafe void CopyColorPlanes( BitmapData bmp, IntPtr \_b, IntPtr \_g, IntPtr \_r ) { byte\* imgPtr = (byte\*)bmp.Scan0; // // single tasked version - one row at a time..... int h = bmp.Height; int w = bmp.Width; int s = bmp.Stride; byte\* b = (byte\*)\_b; byte\* g = (byte\*)\_g; byte\* r = (byte\*)\_r; for ( int row = 0; row < h; row++ ) { for ( int col = 0; col < w; col++ ) { \*imgPtr++ = \*b++; \*imgPtr++ = \*g++; \*imgPtr++ = \*r++; } imgPtr += ( ( s / 3 ) - w ) \* 3; // ensures we're starting the row properly aligned } }
This works well and is "reasonably" fast - a 1600x1200 color image conversion takes roughly 42ms on a 1.8MHz VIA C7 (the target system). Two questions: 1) Does anyone see anything in the above method that could be tweaked (staying within "pure" C#) to make it faster? (I've already tried partitioning the source planes into halves and quarters to make them fit better in the CPU cache and while this has a small impact it's not significant and interestingly, doing columns in the outer loop runs about 10% faster on an AMD DualCore 4200 - go figger). 2) Is there some native Windows API that will do this job? I know I probably end up crafting this in assembly but I view that as a last resort... and deadlines loom...
Subvert The Dominant Paradigm
Don't use byte pointers for reading the source colors. Maintain uint* or stretch to ulong* instead of using byte*. Precalculate the single loop count, width*height/4 or /8 if you use ulong. In the loop calculate 4 or 8 bytes at a time
int work;
begin loop
int tempRed = *_r++; // you might also try int& tempRed = *_r++;
//ditto for other colors
// I am pretty sure that dotNet has strict endian rules, so this should be safe.
*imgPtr++ = alpha | ((tempRed & 0x000000FF) << 16) | ((tempGreen & 0x000000FF) << 8) | ((tempBlue & 0x000000FF));
*imgPtr++ = alpha | ((tempRed & 0x0000FF00) << 8) | ((tempGreen & 0x0000FF00)) | ((tempBlue & 0x0000FF00) >> 8);
*imgPtr++ = alpha | ((tempRed & 0x00FF0000) ) | ((tempGreen & 0x00FF0000) >> 8) | ((tempBlue & 0xFF0000) >> 16);
*imgPtr++ = alpha | ((tempRed & 0xFF000000) >> 8) | ((tempGreen & 0xFF000000) >> 16) | ((tempBlue & 0xFF000000) >> 24);
end loopYou will need some after-loop checks that perform the same logic for stragglers. (modulus 4 or 8).
switch(leftover_modulus) {
case 3:
*imgPtr++ = alpha | ((tempRed & 0x00FF0000) ) | ((tempGreen & 0xFF0000) >> 8) | ((tempBlue & 0xFF0000) >> 16);
// fall thru
case 2:
...
// fall thru
case 1:
...
break;
default:
// nothing to do if 0 since it would have been handled in the loop
} -
Loop unrolling in the form of parallel loop execution would help if it were C/C++. You'd need to use array indexing into the rows rather than incrementing pointers. Reading the data in large data-cache-line sized chunks would also be a win in C/C++. The instructions to mask and shift can fit in the instruction cache and they will absolutely scream though the data being no longer RAM-limited. Maybe the MMX registers can be applied here? Writing the output data as uncached data-cache-line sized chunks also helps in C/C++ on some CPU architectures. Not writing back through the CPU data cache helps keep the input data in it (= fewer misses). Unfortunately, to get to this level of tuning the code, you need to be generating native instructions where you have some chance or being able to predict what the CPU and its caches will be chewing on at any given moment. Running inside the C# interpreter you're probably SOL since you have neither any idea nor control of what native code is getting executed. BTW, that operation on a 2MP image takes no more than a couple of milliseconds in our C++ code.. and that's without applying all the fancy tricks I mentioned above. You best bet is probably to just bite the bullet and thunk through to a native language for the high performance work to get the ~20X speed improvement -- that's what I'd do unless I was in a mood to learn something about what kinds of performance C# can be made to do.
patbob