float vs double - tricky performance issue [modified]

Chris Losinger

(this is a long one, i apologize) i'm working on an image filter. the basic operation is: 1. run across a row of pixels (one BYTE per component); perform a calculation and store the result in an array of floats. the output array is the same width as a pixel row. 2. run across the same row, but in the opposite direction; perform a similar (but not identical) calculation. store the results in a different array of floats. 3. sum each pair from the two float arrays and store the result in a temporary float image of the same dimensions as the input image, but rotate the output 90 degrees. in other words: output rows as columns in the temp image. 4. using the same basic method as 1 and 2, process the floating point temp image. sum the results and write them to an output 8-bit image. so, again: there are two calculation loops and a summation loop per row. after all input rows are done, the process is repeated using the temp as input.

Loop over 8-bit Rows
Loop over Column[y] left to right => tempRow1
Loop over Column[y] right to left => tempRow2
tempRow1 + tempRow2 => temp f.p. image row[y] // 90 deg rotation

Loop over temp f.p. Rows
Loop over Column[y] left to right => tempRow1
Loop over Column[y] right to left => tempRow2
tempRow1 + tempRow2 => output 8-bit image row[y] // -90 deg rotation

just for reference, here's the first calculation loop:

// first four pixels are handled outside the loop.
// pCurPix is a pointer to the source image (BYTEs, variable number of channels).
// pTempPos is an array of floating point values
// the two factor arrays are floating points, too.
for (x=4;x < width;x++)
{

=

numerator_factors_positive[0] * pCurPix[0]

numerator_factors_positive[1] * pCurPix[-iChannels]
numerator_factors_positive[2] * pCurPix[-2 * iChannels]
numerator_factors_positive[3] * pCurPix[-3 * iChannels]

denominator_factors_positive[1] * pTempPos[x - 1]
denominator_factors_positive[2] * pTempPos[x - 2]
denominator_factors_positive[3] * pTempPos[x - 3]
denominator_factors_positive[4] * pTempPos[x - 4];

pCurPix+=uChannels;
}

that's the first loop. the third loop looks identical, except that pCurPix is a pointer to a floating point value in the third loop - it is a BYTE pointer here. the 2nd and 4th loops are very similar to that and are also identical to each other - again, except for the pCurPix data type. also, i wrote this code as a template so i can switch

Lost User

Odd, according to the manual the speeds of fmul fadd and fsub are the same for all normal inputs, which is everything except denormals infinities and NANs. Have you checked for denormals? edit: it changes a little if you're using SSE, but not much, I would still check for denormals just in case

Chris Losinger

yeah, i've been looking for that. it was blowing up at sigma = 93.5 because of a div by zero in the factors calculations (which put a bunch of #INFs and #INDs in the calcs). but everything looks reasonable in the sigma=3 range.

image processing toolkits | batch image processing

TimothyPMoore

Any chance you can show your code that includes your use of the sigma function parameter ? Does your code produce the correct values ?

Chris Losinger

sure. here's most of it:

void CalcIIRFactors (FloatType sigma)
{
// constants from eq 38

  const FloatType b0 = (FloatType)-1.783;
  const FloatType b1 = (FloatType)-1.723;
  const FloatType w0 =  (FloatType)0.6318;
  const FloatType w1 =  (FloatType)1.997;

  const FloatType b0OverSigma = b0 / sigma;
  const FloatType b1OverSigma = b1 / sigma;
  const FloatType w0OverSigma = w0 / sigma;
  const FloatType w1OverSigma = w1 / sigma;

  const FloatType pi = (FloatType)acos(-1.0);
  const FloatType scale = sqrt (2 \* pi) \* sigma;
  const FloatType a0 = (FloatType)1.680 / scale;
  const FloatType a1 = (FloatType)3.735 / scale;
  const FloatType c0 = (FloatType)-0.6803 / scale;
  const FloatType c1 = (FloatType)-0.2598 / scale;

  // eq 21
  numerator\_factors\_positive \[0\] = a0 + c0;
  numerator\_factors\_positive \[1\] = 
(exp(b1OverSigma)\*(c1\*sin(w1OverSigma)-
(c0+2\*a0)\*cos(w1OverSigma)) + 
exp(b0OverSigma)\*(a1\*sin(w0OverSigma) - 
(2\*c0+a0)\*cos (w0OverSigma)));

  numerator\_factors\_positive \[2\] = 
(2 \* exp(b0OverSigma+b1OverSigma) \* 
((a0+c0)\*cos(w1OverSigma)\*cos(w0OverSigma) - 
a1\*cos(w1OverSigma)\*sin(w0OverSigma) - 
c1\*cos(w0OverSigma)\*sin(w1OverSigma)) + 
c0\*exp(2\*b0OverSigma) + a0\*exp(2\*b1OverSigma));

  numerator\_factors\_positive \[3\] = 
(exp(b1OverSigma+2\*b0OverSigma) \* (c1\*sin(w1OverSigma) - 
c0\*cos(w1OverSigma)) + exp(b0OverSigma+2\*b1OverSigma) \* 
(a1\*sin(w0OverSigma) - a0\*cos(w0OverSigma)));

  numerator\_factors\_positive \[4\] = 0.0;

  // eq 22
  denominator\_factors\_positive \[0\] = 0.0;
  denominator\_factors\_positive \[1\] = 
-2 \* exp(b1OverSigma) \* cos(w1OverSigma) -  
2 \* exp(b0OverSigma) \* cos (w0OverSigma);

  denominator\_factors\_positive \[2\] = 
4 \* cos(w1OverSigma) \* cos(w0OverSigma) \* 
exp(b0OverSigma + b1OverSigma) +  
exp(2 \* b1OverSigma) + exp(2 \* b0OverSigma);

  denominator\_factors\_positive \[3\] = 
-2 \* cos(w0OverSigma) \* exp(b0OverSigma + 
2\*b1OverSigma) -  2\*cos(w1OverSigma) \* 
exp(b1OverSigma + 2\*b0OverSigma);

  denominator\_factors\_positive \[4\] = exp(2\*b0OverSigma + 2\*b1OverSigma);

  int i;

  // FACTORS = 5
  for (i = 0; i < FACTORS; i++)
  {
     denominator\_factors\_negative\[i\] = denominator\_factors\_positive\[i\];
  }

  numerator\_factors\_negative\[0\] = 0.0;  

  for (i = 1; i < FACTORS; i++)
  {

TimothyPMoore

I understand about the proprietary requirements. Your code produces normal values for the numerator/demonimator values over the range of sigma that I you mentioned in my experiments. The only suggestion I have at this point is can you instrument your runtime statistics at a finer levels to see what portions of the your calculations could be responsible to efficiency reduction. If you find the solution it would be nice if you could post a follow-up to the thread. Good luck.

Rick York

Since performance is an issue you could probably achieve a speed improvement if you compute some more temporary variables like sin, cos, and exp of various expressions because several of those are recomputed more than once. For example, cos(w1OverSigma) is computed seven times in that function.

Chris Losinger

the thing is, those are calculated ahead of the actual processing, and don't change. the actual work is pretty much 4x the loop i posted.

image processing toolkits | batch image processing