fast fast fast

__erfan__

hi. i am writing a video processing application. at first i tried to use OpenCV and it was greate as it had very high performance. but i need to implement some functions by myself and i am doing this. but the main peroblem is the "performance". for example, i used the cvSub(...) funcation to subtract two 2D arrays. and then i tried to do this with a simple "for loop". the calculation times are shown below: openCV : 0.135496 ms mycode : 0.376614 ms and i ran the program several times and the result was the same. so i am here to ask : do you have any idea for performing the task as fast as OpenCV? part of my code :

double t2;
printf("-----cvSub on cvMat------\n");
for(int i=0; i<5; i++)
{
t2 = (double)cvGetTickCount();
cvSub(mat1, mat2, mat3, 0);
t2 = (double)cvGetTickCount() - t2;
printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
}
printf("-----sub on 2D array------\n");
for(int i=0; i<5; i++)
{
t2 = (double)cvGetTickCount();
for(int j=0; j<240; j++)
{
for(int k=0; k<320; k++)
{
m3[j][k] = m1[j][k]-m2[j][k];
}
}
t2 = (double)cvGetTickCount() - t2;
printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
}

each method is performed 5 times to get the mean time

CPallini

Since the OpenCV source code is available, why don't you have a look at it? :)

If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
[My articles]

Rajesh R Subramanian

Have you tried disabling the printf command while measuring the performance? I reckon it takes a bit of time. :) You could store all those values to be printed into a vector or something and print it all together at the end (of course don't count that as execution time).

“Follow your bliss.” – Joseph Campbell

__erfan__

I've searched for it and ... nothing. :doh: if you find that please let me know

__erfan__

thanks for your help. but printf is out of timming block and is present in both methods

CPallini

__erfan__ wrote:

if you find that please let me know

I found it (you've to go a bit deep inside the source code): The library uses the SSE2 for integer arithmetic operations (see [^]), in particular _mm_subs_epu8 for cvSub. :)

If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
[My articles]

Rolf Kristensen

You could consider doing 4 double operations in each for-loop. This would allow modern CPU's to perform the operation in parallel. Taken from Software Optimization Guide for AMD64 Processors[^]:

Rationale and Examples This is especially important to break long dependency chains into smaller executing units in floating-point code, whether it is mapped to x87, SSE, or SSE2 instructions, because of the longer latency of floating-point operations. Because most languages (including ANSI C) guarantee that floating-point expressions are not reordered, compilers cannot usually perform such optimizations unless they offer a switch to allow noncompliant reordering of floating-point expressions according to algebraic rules. Reordered code that is algebraically identical to the original code does not necessarily produce identical computational results due to the lack of associativity of floating-point operations. There are well-known numerical considerations in applying these optimizations (consult a book on numerical analysis). In some cases, these optimizations may lead to unexpected results. In the vast majority of cases, the final result differs only in the least-significant bits. Listing 10. Avoid

double a[100], sum;
int i;
sum = 0.0f;
for (i = 0; i < 100; i++) {
sum += a[i];
}

Listing 11. Preferred

double a[100], sum1, sum2, sum3, sum4, sum;
int i;
sum1 = 0.0;
sum2 = 0.0;
sum3 = 0.0;
sum4 = 0.0;
for (i = 0; i < 100; i + 4) {
sum1 += a[i];
sum2 += a[i+1];
sum3 += a[i+2];
sum4 += a[i+3];
}
sum = (sum4 + sum3) + (sum1 + sum2);

Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point adder. Each stage of the floating-point adder is occupied on every clock cycle, ensuring maximum sustained utilization.

CPallini

Snakefoot wrote:

Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point adder.

Notice the four-way unrolling will eat up all the CPU time (and block the application) just to perform the first four operations ;P :)

If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
[My articles]

__erfan__

thank you guys. my brain is telling me go to sleep i am on my computer for 24 hours and ... :zzz: