Talk about demoralizing
-
I knew nothing of dithering, so I just read about it. Interesting. Eventually you'll be a numerical analysis weenie. :-D It also means that what I was going to ask, namely whether you could cache frequently used results, seems to make no sense.
Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing.Yeah. My current plan is to stick with nearest matching color only for color displays. I'm going to try dithering once again on black and white since those algorithms run much faster. Edit: Black and white dithering is fast fast fast, so I've added support for it to b&w e-paper displays. I don't do it for monochrome displays yet, and I'm not sure I will since they can update in real time and dithering interferes with that, but since you can turn it off maybe I'll add support for it. I only do nearest color matching for color e-ink displays. The color dithering as I said, was just too expensive.
Real programmers use butterflies
-
I spent all day and night working on dithering routines. I finally got it all compiling, and it worked the first time it ran. And it only took 5 minutes to render a frame. :(( :(( :(( So I replaced it with an algorithm that's supposed to be much faster. I got it down to 3 minutes. I cannot use this code that ran myself ragged developing. *headdesk*
Real programmers use butterflies
Well.. look at it this way, the code you took 1 day to write is almost as good as this PhD code the guy spent a few years perfecting! :O :)
A new .NET Serializer All in one Menu-Ribbon Bar Taking over the world since 1371!
-
I spent all day and night working on dithering routines. I finally got it all compiling, and it worked the first time it ran. And it only took 5 minutes to render a frame. :(( :(( :(( So I replaced it with an algorithm that's supposed to be much faster. I got it down to 3 minutes. I cannot use this code that ran myself ragged developing. *headdesk*
Real programmers use butterflies
I guess you do a bit of rounding then? If you do, it may be that it is possible to improve that by a lot of bit-fiddling … On x64 I've seen significant performance improvement, between 20% and 300%, for the following: isnan, isinf, signbit, frexp, min, max, trunc, round, clamp and lerp I have no idea about how well this will work out for an ARM cpu, but here is the core of my implementation (Sorry about the formatting, paste and encode as HTML doesn't work well for C++ code anymore :(():
template
struct FractionWidth;template <>
struct FractionWidth
{
static constexpr UInt32 value = 23;
};
template <>
struct FractionWidth
{
static constexpr UInt32 value = 52;
};template
struct ExponentWidth;
template <>
struct ExponentWidth
{
static constexpr UInt32 value = 8;
};
template <>
struct ExponentWidth
{
static constexpr UInt32 value = 11;
};template
struct ExponenBias;
template <>
struct ExponenBias
{
static constexpr UInt32 value = _FBIAS;
};
template <>
struct ExponenBias
{
static constexpr UInt32 value = _DBIAS;
};template
struct InfinityUnsignedValue;
template <>
struct InfinityUnsignedValue
{
static constexpr UInt32 value = 0X7F800000UL;
};
template <>
struct InfinityUnsignedValue
{
static constexpr UInt64 value = 0x7FF0000000000000ULL;
};template
struct NegativeInfinityUnsignedValue;
template <>
struct NegativeInfinityUnsignedValue
{
static constexpr UInt32 value = 0xFF800000UL;
};
template <>
struct NegativeInfinityUnsignedValue
{
static constexpr UInt64 value = 0xFFF0000000000000ULL;
};
template
struct QuietNaNUnsignedValue;
template <>
struct QuietNaNUnsignedValue
{
static constexpr UInt32 value = 0XFFC00001UL;
};
template <>
struct QuietNaNUnsignedValue
{
static constexpr UInt64 value = 0x7FF0000000000001ULL;
};#pragma pack(push,1)
template
struct FloatingPoint
{
using ValueType = std::remove_cvref_t;
using UIntType = MakeUnsigned;static constexpr Int32 FractionWidth = static\_cast( Internal::FractionWidth::value ); static constexpr Int32 ExponentWidth = static\_cast( Internal::ExponentWidth::value ); static constexpr Int32 ExponentBias = ( 1 << ( ExponentWidth - 1 ) ) - 1; st
-
I guess you do a bit of rounding then? If you do, it may be that it is possible to improve that by a lot of bit-fiddling … On x64 I've seen significant performance improvement, between 20% and 300%, for the following: isnan, isinf, signbit, frexp, min, max, trunc, round, clamp and lerp I have no idea about how well this will work out for an ARM cpu, but here is the core of my implementation (Sorry about the formatting, paste and encode as HTML doesn't work well for C++ code anymore :(():
template
struct FractionWidth;template <>
struct FractionWidth
{
static constexpr UInt32 value = 23;
};
template <>
struct FractionWidth
{
static constexpr UInt32 value = 52;
};template
struct ExponentWidth;
template <>
struct ExponentWidth
{
static constexpr UInt32 value = 8;
};
template <>
struct ExponentWidth
{
static constexpr UInt32 value = 11;
};template
struct ExponenBias;
template <>
struct ExponenBias
{
static constexpr UInt32 value = _FBIAS;
};
template <>
struct ExponenBias
{
static constexpr UInt32 value = _DBIAS;
};template
struct InfinityUnsignedValue;
template <>
struct InfinityUnsignedValue
{
static constexpr UInt32 value = 0X7F800000UL;
};
template <>
struct InfinityUnsignedValue
{
static constexpr UInt64 value = 0x7FF0000000000000ULL;
};template
struct NegativeInfinityUnsignedValue;
template <>
struct NegativeInfinityUnsignedValue
{
static constexpr UInt32 value = 0xFF800000UL;
};
template <>
struct NegativeInfinityUnsignedValue
{
static constexpr UInt64 value = 0xFFF0000000000000ULL;
};
template
struct QuietNaNUnsignedValue;
template <>
struct QuietNaNUnsignedValue
{
static constexpr UInt32 value = 0XFFC00001UL;
};
template <>
struct QuietNaNUnsignedValue
{
static constexpr UInt64 value = 0x7FF0000000000001ULL;
};#pragma pack(push,1)
template
struct FloatingPoint
{
using ValueType = std::remove_cvref_t;
using UIntType = MakeUnsigned;static constexpr Int32 FractionWidth = static\_cast( Internal::FractionWidth::value ); static constexpr Int32 ExponentWidth = static\_cast( Internal::ExponentWidth::value ); static constexpr Int32 ExponentBias = ( 1 << ( ExponentWidth - 1 ) ) - 1; st
If it was closer to performing like I need I'd twiddle with optimizations like this, but this sort of improvement isn't going to change the code from taking minutes to taking seconds, and that's what I need. I've basically abandoned color dithering for this project.
Real programmers use butterflies
-
I spent all day and night working on dithering routines. I finally got it all compiling, and it worked the first time it ran. And it only took 5 minutes to render a frame. :(( :(( :(( So I replaced it with an algorithm that's supposed to be much faster. I got it down to 3 minutes. I cannot use this code that ran myself ragged developing. *headdesk*
Real programmers use butterflies
I am surprised it is that bad. I thought something like the ESP32 at 240Mhz would do a Floyd-Steinberg reasonable well, at least keeping up with the 680x0's and 86x86's I used to run Floyd-Steinberg on 25 odd years ago. I know it is a different architecture, but it is also a 200Mhz speed advantage. But maybe it was just a lot slower than I recall it - back then we where amazed it displayed something at all. I am quite sure it never took a minute though - but if it was ½ or 20 seconds who knows. I wonder if it is memory access or something slowing it down. Takes ages to develop these kind of things though. As soon as it displays something, you loose the next couple of hours looking at it before moving on. :)
-
I am surprised it is that bad. I thought something like the ESP32 at 240Mhz would do a Floyd-Steinberg reasonable well, at least keeping up with the 680x0's and 86x86's I used to run Floyd-Steinberg on 25 odd years ago. I know it is a different architecture, but it is also a 200Mhz speed advantage. But maybe it was just a lot slower than I recall it - back then we where amazed it displayed something at all. I am quite sure it never took a minute though - but if it was ½ or 20 seconds who knows. I wonder if it is memory access or something slowing it down. Takes ages to develop these kind of things though. As soon as it displays something, you loose the next couple of hours looking at it before moving on. :)
I can't do floyd steinberg because of the memory requirements. I do a similar style as Thomas Knoll's adobe photoshop grid dithering method for my "slow" dithering, and an optimized Yliluoma algorithm for my "fast" dithering. Both are far too slow.
Real programmers use butterflies
-
I am surprised it is that bad. I thought something like the ESP32 at 240Mhz would do a Floyd-Steinberg reasonable well, at least keeping up with the 680x0's and 86x86's I used to run Floyd-Steinberg on 25 odd years ago. I know it is a different architecture, but it is also a 200Mhz speed advantage. But maybe it was just a lot slower than I recall it - back then we where amazed it displayed something at all. I am quite sure it never took a minute though - but if it was ½ or 20 seconds who knows. I wonder if it is memory access or something slowing it down. Takes ages to develop these kind of things though. As soon as it displays something, you loose the next couple of hours looking at it before moving on. :)
Adding were you doing color dithering? My black and white bayer dithering is quite fast. Also if you were doing color dithering there are much faster algos you can use when simply simulating a higher bit depth, but I actually have to do color matching to a palette. Here's how I have to choose two colors to blend:
template
gfx_result dither_mixing_plan_fast(const PaletteType* palette, typename PaletteType::mapped_pixel_type color, dither_mixing_plan_data_fast* plan) {
gfx_result rr ;
if(nullptr==plan || nullptr==palette) {
return gfx_result::invalid_argument;
}
rgb_pixel<24> rgb888;
rr = convert(color,&rgb888);
if(gfx_result::success!=rr) {
return rr;
}
const unsigned r= rgb888.template channel(),
g=rgb888.template channel(),
b=rgb888.template channel();\*plan = { {0,0}, 0.5 }; double least\_penalty = 1e99; for(unsigned index1 = 0; index1 < 16; ++index1) for(unsigned index2 = index1; index2 < 16; ++index2) { // Determine the two component colors typename PaletteType::mapped\_pixel\_type mpx1; rr=palette->map(typename PaletteType::pixel\_type(index1),&mpx1); if(gfx\_result::success!=rr) { return rr; } typename PaletteType::mapped\_pixel\_type mpx2; rr=palette->map(typename PaletteType::pixel\_type(index2),&mpx1); if(gfx\_result::success!=rr) { return rr; } rr = convert(mpx1,&rgb888); if(gfx\_result::success!=rr) { return rr; } unsigned r1= rgb888.template channel(), g1=rgb888.template channel(), b1=rgb888.template channel(); rr = convert(mpx2,&rgb888); if(gfx\_result::success!=rr) { return rr; } unsigned r2= rgb888.template channel(), g2=rgb888.template channel(), b2=rgb888.template channel(); int ratio = 32; if(mpx1.native\_value != mpx2.native\_value) { // Determine the ratio of mixing for each channel. // solve(r1 + ratio\*(r2-r1)/64 = r, ratio) // Take a weighed average of these three ratios according to the // perceived luminosity of each channel (accordin
-
I guess you do a bit of rounding then? If you do, it may be that it is possible to improve that by a lot of bit-fiddling … On x64 I've seen significant performance improvement, between 20% and 300%, for the following: isnan, isinf, signbit, frexp, min, max, trunc, round, clamp and lerp I have no idea about how well this will work out for an ARM cpu, but here is the core of my implementation (Sorry about the formatting, paste and encode as HTML doesn't work well for C++ code anymore :(():
template
struct FractionWidth;template <>
struct FractionWidth
{
static constexpr UInt32 value = 23;
};
template <>
struct FractionWidth
{
static constexpr UInt32 value = 52;
};template
struct ExponentWidth;
template <>
struct ExponentWidth
{
static constexpr UInt32 value = 8;
};
template <>
struct ExponentWidth
{
static constexpr UInt32 value = 11;
};template
struct ExponenBias;
template <>
struct ExponenBias
{
static constexpr UInt32 value = _FBIAS;
};
template <>
struct ExponenBias
{
static constexpr UInt32 value = _DBIAS;
};template
struct InfinityUnsignedValue;
template <>
struct InfinityUnsignedValue
{
static constexpr UInt32 value = 0X7F800000UL;
};
template <>
struct InfinityUnsignedValue
{
static constexpr UInt64 value = 0x7FF0000000000000ULL;
};template
struct NegativeInfinityUnsignedValue;
template <>
struct NegativeInfinityUnsignedValue
{
static constexpr UInt32 value = 0xFF800000UL;
};
template <>
struct NegativeInfinityUnsignedValue
{
static constexpr UInt64 value = 0xFFF0000000000000ULL;
};
template
struct QuietNaNUnsignedValue;
template <>
struct QuietNaNUnsignedValue
{
static constexpr UInt32 value = 0XFFC00001UL;
};
template <>
struct QuietNaNUnsignedValue
{
static constexpr UInt64 value = 0x7FF0000000000001ULL;
};#pragma pack(push,1)
template
struct FloatingPoint
{
using ValueType = std::remove_cvref_t;
using UIntType = MakeUnsigned;static constexpr Int32 FractionWidth = static\_cast( Internal::FractionWidth::value ); static constexpr Int32 ExponentWidth = static\_cast( Internal::ExponentWidth::value ); static constexpr Int32 ExponentBias = ( 1 << ( ExponentWidth - 1 ) ) - 1; st
You made me curious. It looks like this worked. There were a couple quirk texts in your paste I had to eliminate, but the main thing was using Notepad++ to convert it to ANSI before pasting here. Weird - it took a bit of work, like 5 mins puttering... For my own code, I do a two-step process. First paste as HTML and encode, then copy everything, delete it, and repaste as C++.
template <typename T>
struct FractionWidth;template <>
struct FractionWidth<float>
{
static constexpr UInt32 value = 23;
};
template <>
struct FractionWidth<double>
{
static constexpr UInt32 value = 52;
};template <typename T>
struct ExponentWidth;
template <>
struct ExponentWidth<float>
{
static constexpr UInt32 value = 8;
};
template <>
struct ExponentWidth<double>
{
static constexpr UInt32 value = 11;
};template <typename T>
struct ExponenBias;
template <>
struct ExponenBias<float>
{
static constexpr UInt32 value = _FBIAS;
};
template <>
struct ExponenBias<double>
{
static constexpr UInt32 value = _DBIAS;
};template <typename T>
struct InfinityUnsignedValue;
template <>
struct InfinityUnsignedValue<float>
{
static constexpr UInt32 value = 0X7F800000UL;
};
template <>
struct InfinityUnsignedValue<double>
{
static constexpr UInt64 value = 0x7FF0000000000000ULL;
};template <typename T>
struct NegativeInfinityUnsignedValue;
template <>
struct NegativeInfinityUnsignedValue<float>
{
static constexpr UInt32 value = 0xFF800000UL;
};
template <>
struct NegativeInfinityUnsignedValue<double>
{
static constexpr UInt64 value = 0xFFF0000000000000ULL;
};
template <typename T>
struct QuietNaNUnsignedValue;
template <>
struct QuietNaNUnsignedValue<float>
{
static constexpr UInt32 value = 0XFFC00001UL;
};
template <>
struct QuietNaNUnsignedValue<double>
{
static constexpr UInt64 value = 0x7FF0000000000001ULL;
};pragma pack(push,1);
template<typename T>
struct FloatingPoint
{
using ValueType = std::remove_cvref_t<T>;
using UIntType = MakeUnsigned<ValueType>;<pre>
static constexpr Int32 FractionWidth = static_cast<Int32>( Internal::FractionWidth<ValueType>::value );
static constexpr Int32 ExponentWidth = static_cast<Int32>( Internal::ExponentWidth<ValueTyp -
I spent all day and night working on dithering routines. I finally got it all compiling, and it worked the first time it ran. And it only took 5 minutes to render a frame. :(( :(( :(( So I replaced it with an algorithm that's supposed to be much faster. I got it down to 3 minutes. I cannot use this code that ran myself ragged developing. *headdesk*
Real programmers use butterflies
:confused: A N^2-level grey scale dithering for a 2-colour screen (B&W) may be performed with an NxN integer matrix. If N is a power of 2, the most expensive operation would be masking the X, Y ordinates to the range 0 .. N-1, and reading the dither threshold value. Unless you are dealing with giant pictures, how would the dithering take that long? (You could even speed the rendering by dithering a partial block of the picture in memory, and then bitblt-ing the block to the screen. The size of the block would depend on the memory available.) As for colour dithering, assuming that you have an 8-colour screen (on/off for each of R,G,B), perhaps you could use a B&W dithering algorithm on each level - R, G, B, and then combine them. I do not know about the quality of the colours, though...
Freedom is the freedom to say that two plus two make four. If that is granted, all else follows. -- 6079 Smith W.
-
:confused: A N^2-level grey scale dithering for a 2-colour screen (B&W) may be performed with an NxN integer matrix. If N is a power of 2, the most expensive operation would be masking the X, Y ordinates to the range 0 .. N-1, and reading the dither threshold value. Unless you are dealing with giant pictures, how would the dithering take that long? (You could even speed the rendering by dithering a partial block of the picture in memory, and then bitblt-ing the block to the screen. The size of the block would depend on the memory available.) As for colour dithering, assuming that you have an 8-colour screen (on/off for each of R,G,B), perhaps you could use a B&W dithering algorithm on each level - R, G, B, and then combine them. I do not know about the quality of the colours, though...
Freedom is the freedom to say that two plus two make four. If that is granted, all else follows. -- 6079 Smith W.
Grayscale dithering is not color dithering. Also it's much faster to dither when you have a palette where the colors are evenly distributed/gradiated. When you're doing color matching to dither between, say 7 available colors while loading a 24-bit color JPEG, that's a different story. You can't color dither like you suggest. You have to mix two colors, not just one color and a fixed color like black This is what finding the two colors to mix looks like - or at least one of the ways I use:
template
gfx_result dither_mixing_plan_fast(const PaletteType* palette, typename PaletteType::mapped_pixel_type color, dither_mixing_plan_data_fast* plan) {
gfx_result rr ;
if(nullptr==plan || nullptr==palette) {
return gfx_result::invalid_argument;
}
rgb_pixel<24> rgb888;
rr = convert(color,&rgb888);
if(gfx_result::success!=rr) {
return rr;
}
const unsigned r= rgb888.template channel(),
g=rgb888.template channel(),
b=rgb888.template channel();\*plan = { {0,0}, 0.5 }; double least\_penalty = 1e99; for(unsigned index1 = 0; index1 < PaletteType::size; ++index1) for(unsigned index2 = index1; index2 < PaletteType::size; ++index2) { // Determine the two component colors typename PaletteType::mapped\_pixel\_type mpx1; rr=palette->map(typename PaletteType::pixel\_type(index1),&mpx1); if(gfx\_result::success!=rr) { return rr; } typename PaletteType::mapped\_pixel\_type mpx2; rr=palette->map(typename PaletteType::pixel\_type(index2),&mpx1); if(gfx\_result::success!=rr) { return rr; } rr = convert(mpx1,&rgb888); if(gfx\_result::success!=rr) { return rr; } unsigned r1= rgb888.template channel(), g1=rgb888.template channel(), b1=rgb888.template channel(); rr = convert(mpx2,&rgb888); if(gfx\_result::success!=rr) { return rr; } unsigned r2= rgb888.template channel(), g2=rgb888.template channel(), b2=rgb888.template channel(); int ratio = 32; if(mpx1.native\_value != mpx2.native\_value) { // Determine the ratio of mixing for
-
Adding were you doing color dithering? My black and white bayer dithering is quite fast. Also if you were doing color dithering there are much faster algos you can use when simply simulating a higher bit depth, but I actually have to do color matching to a palette. Here's how I have to choose two colors to blend:
template
gfx_result dither_mixing_plan_fast(const PaletteType* palette, typename PaletteType::mapped_pixel_type color, dither_mixing_plan_data_fast* plan) {
gfx_result rr ;
if(nullptr==plan || nullptr==palette) {
return gfx_result::invalid_argument;
}
rgb_pixel<24> rgb888;
rr = convert(color,&rgb888);
if(gfx_result::success!=rr) {
return rr;
}
const unsigned r= rgb888.template channel(),
g=rgb888.template channel(),
b=rgb888.template channel();\*plan = { {0,0}, 0.5 }; double least\_penalty = 1e99; for(unsigned index1 = 0; index1 < 16; ++index1) for(unsigned index2 = index1; index2 < 16; ++index2) { // Determine the two component colors typename PaletteType::mapped\_pixel\_type mpx1; rr=palette->map(typename PaletteType::pixel\_type(index1),&mpx1); if(gfx\_result::success!=rr) { return rr; } typename PaletteType::mapped\_pixel\_type mpx2; rr=palette->map(typename PaletteType::pixel\_type(index2),&mpx1); if(gfx\_result::success!=rr) { return rr; } rr = convert(mpx1,&rgb888); if(gfx\_result::success!=rr) { return rr; } unsigned r1= rgb888.template channel(), g1=rgb888.template channel(), b1=rgb888.template channel(); rr = convert(mpx2,&rgb888); if(gfx\_result::success!=rr) { return rr; } unsigned r2= rgb888.template channel(), g2=rgb888.template channel(), b2=rgb888.template channel(); int ratio = 32; if(mpx1.native\_value != mpx2.native\_value) { // Determine the ratio of mixing for each channel. // solve(r1 + ratio\*(r2-r1)/64 = r, ratio) // Take a weighed average of these three ratios according to the // perceived luminosity of each channel (accordin
It was color. As far as I recall, we had a 16 color fixed palette, and anything available above that would be "allocated" as images where displayed (I think the Amiga supported 32, 64, 128, 256 - while our Windows support would go directly to 256). We also supported 16/24 bit, but I believe we just did nearest color on 16 bit. Palette entries where allocated based on one or another algorithm based on distance to nearest "existing" color, the number of pixels using the color, and the number of free palette entries. But most likely our images where just small enough that we did not encounter memory issues. I recall we did support Amiga 500, but we might have had restrictions on features there. Most systems would have had at least 1MB, and then it is no problem keeping an extra scanline or two in memory for dithering.
-
It was color. As far as I recall, we had a 16 color fixed palette, and anything available above that would be "allocated" as images where displayed (I think the Amiga supported 32, 64, 128, 256 - while our Windows support would go directly to 256). We also supported 16/24 bit, but I believe we just did nearest color on 16 bit. Palette entries where allocated based on one or another algorithm based on distance to nearest "existing" color, the number of pixels using the color, and the number of free palette entries. But most likely our images where just small enough that we did not encounter memory issues. I recall we did support Amiga 500, but we might have had restrictions on features there. Most systems would have had at least 1MB, and then it is no problem keeping an extra scanline or two in memory for dithering.
lmoelleb wrote:
Palette entries where allocated based on one or another algorithm based on distance to nearest "existing" color, the number of pixels using the color, and the number of free palette entries.
I'm not sure what you mean by nearest existing color, as my algo has to find *two* colors in order to determine what to blend with what. I have a KD tree implementation waiting in the wings for larger palettes since it sorts in such a way as to speed up distance based matching, but it's not helpful for say, 16 colors. I may "pre-expand" the palette, mixing colors beforehand, so a 16 color palette becomes (16*15)/2 colors, and then trying throwing that into a kd_tree and see what happens. But that's my biggest issue, is finding the two colors to blend. The rest is fast.
Real programmers use butterflies
-
lmoelleb wrote:
Palette entries where allocated based on one or another algorithm based on distance to nearest "existing" color, the number of pixels using the color, and the number of free palette entries.
I'm not sure what you mean by nearest existing color, as my algo has to find *two* colors in order to determine what to blend with what. I have a KD tree implementation waiting in the wings for larger palettes since it sorts in such a way as to speed up distance based matching, but it's not helpful for say, 16 colors. I may "pre-expand" the palette, mixing colors beforehand, so a 16 color palette becomes (16*15)/2 colors, and then trying throwing that into a kd_tree and see what happens. But that's my biggest issue, is finding the two colors to blend. The rest is fast.
Real programmers use butterflies
We where mainly (maybe only) loading images with a palette. So we where dithering one palette to another, meaning there was a max of 256 colors in the source image (and often less than that). I This allowed us to quickly calculate the total number of a given color, and for each color calculate how close it as to existing colors in our palette. I guess it is pretty useless these days as no one works with palettes (I think we had a jpg decoder for fun, but probably was sticking to gif and various other formats for images we really needed). Once the palette was locked in for an image, we could just run the Floyd-Steinberg. It only requires finding the nearest color per pixel as "blending" is done by pushing the error ahead of the calculations (at the cost of one scanline extra memory consumed - though I guess you could stamp it into the bitmap data directly). No problem on a 500KB system with relative small images. We had the advantage advantage that low CPU spec systems where typically also running lower colors (so we only had to search for nearest color in 16 or 32 target colors), while systems running 256 colors typically also had more CPU power.
-
We where mainly (maybe only) loading images with a palette. So we where dithering one palette to another, meaning there was a max of 256 colors in the source image (and often less than that). I This allowed us to quickly calculate the total number of a given color, and for each color calculate how close it as to existing colors in our palette. I guess it is pretty useless these days as no one works with palettes (I think we had a jpg decoder for fun, but probably was sticking to gif and various other formats for images we really needed). Once the palette was locked in for an image, we could just run the Floyd-Steinberg. It only requires finding the nearest color per pixel as "blending" is done by pushing the error ahead of the calculations (at the cost of one scanline extra memory consumed - though I guess you could stamp it into the bitmap data directly). No problem on a 500KB system with relative small images. We had the advantage advantage that low CPU spec systems where typically also running lower colors (so we only had to search for nearest color in 16 or 32 target colors), while systems running 256 colors typically also had more CPU power.
Actually palettes are very useful these days for e-paper displays, which are either monochrome, or have a *fixed palette* of a handful of colors. That's primarily why my GFX library supports it.
Real programmers use butterflies