fast fast fast
-
hi. i am writing a video processing application. at first i tried to use OpenCV and it was greate as it had very high performance. but i need to implement some functions by myself and i am doing this. but the main peroblem is the "performance". for example, i used the cvSub(...) funcation to subtract two 2D arrays. and then i tried to do this with a simple "for loop". the calculation times are shown below: openCV : 0.135496 ms mycode : 0.376614 ms and i ran the program several times and the result was the same. so i am here to ask : do you have any idea for performing the task as fast as OpenCV? part of my code :
double t2;
printf("-----cvSub on cvMat------\n");
for(int i=0; i<5; i++)
{
t2 = (double)cvGetTickCount();
cvSub(mat1, mat2, mat3, 0);
t2 = (double)cvGetTickCount() - t2;
printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
}
printf("-----sub on 2D array------\n");
for(int i=0; i<5; i++)
{
t2 = (double)cvGetTickCount();
for(int j=0; j<240; j++)
{
for(int k=0; k<320; k++)
{
m3[j][k] = m1[j][k]-m2[j][k];
}
}
t2 = (double)cvGetTickCount() - t2;
printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
}each method is performed 5 times to get the mean time
-
hi. i am writing a video processing application. at first i tried to use OpenCV and it was greate as it had very high performance. but i need to implement some functions by myself and i am doing this. but the main peroblem is the "performance". for example, i used the cvSub(...) funcation to subtract two 2D arrays. and then i tried to do this with a simple "for loop". the calculation times are shown below: openCV : 0.135496 ms mycode : 0.376614 ms and i ran the program several times and the result was the same. so i am here to ask : do you have any idea for performing the task as fast as OpenCV? part of my code :
double t2;
printf("-----cvSub on cvMat------\n");
for(int i=0; i<5; i++)
{
t2 = (double)cvGetTickCount();
cvSub(mat1, mat2, mat3, 0);
t2 = (double)cvGetTickCount() - t2;
printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
}
printf("-----sub on 2D array------\n");
for(int i=0; i<5; i++)
{
t2 = (double)cvGetTickCount();
for(int j=0; j<240; j++)
{
for(int k=0; k<320; k++)
{
m3[j][k] = m1[j][k]-m2[j][k];
}
}
t2 = (double)cvGetTickCount() - t2;
printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
}each method is performed 5 times to get the mean time
Since the
OpenCV
source code is available, why don't you have a look at it? :)If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
[My articles] -
hi. i am writing a video processing application. at first i tried to use OpenCV and it was greate as it had very high performance. but i need to implement some functions by myself and i am doing this. but the main peroblem is the "performance". for example, i used the cvSub(...) funcation to subtract two 2D arrays. and then i tried to do this with a simple "for loop". the calculation times are shown below: openCV : 0.135496 ms mycode : 0.376614 ms and i ran the program several times and the result was the same. so i am here to ask : do you have any idea for performing the task as fast as OpenCV? part of my code :
double t2;
printf("-----cvSub on cvMat------\n");
for(int i=0; i<5; i++)
{
t2 = (double)cvGetTickCount();
cvSub(mat1, mat2, mat3, 0);
t2 = (double)cvGetTickCount() - t2;
printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
}
printf("-----sub on 2D array------\n");
for(int i=0; i<5; i++)
{
t2 = (double)cvGetTickCount();
for(int j=0; j<240; j++)
{
for(int k=0; k<320; k++)
{
m3[j][k] = m1[j][k]-m2[j][k];
}
}
t2 = (double)cvGetTickCount() - t2;
printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
}each method is performed 5 times to get the mean time
Have you tried disabling the
printf
command while measuring the performance? I reckon it takes a bit of time. :) You could store all those values to be printed into a vector or something and print it all together at the end (of course don't count that as execution time).“Follow your bliss.” – Joseph Campbell
-
Since the
OpenCV
source code is available, why don't you have a look at it? :)If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
[My articles] -
Have you tried disabling the
printf
command while measuring the performance? I reckon it takes a bit of time. :) You could store all those values to be printed into a vector or something and print it all together at the end (of course don't count that as execution time).“Follow your bliss.” – Joseph Campbell
-
__erfan__ wrote:
if you find that please let me know
I found it (you've to go a bit deep inside the source code): The library uses the
SSE2
for integer arithmetic operations (see [^]), in particular_mm_subs_epu8
forcvSub
. :)If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
[My articles] -
hi. i am writing a video processing application. at first i tried to use OpenCV and it was greate as it had very high performance. but i need to implement some functions by myself and i am doing this. but the main peroblem is the "performance". for example, i used the cvSub(...) funcation to subtract two 2D arrays. and then i tried to do this with a simple "for loop". the calculation times are shown below: openCV : 0.135496 ms mycode : 0.376614 ms and i ran the program several times and the result was the same. so i am here to ask : do you have any idea for performing the task as fast as OpenCV? part of my code :
double t2;
printf("-----cvSub on cvMat------\n");
for(int i=0; i<5; i++)
{
t2 = (double)cvGetTickCount();
cvSub(mat1, mat2, mat3, 0);
t2 = (double)cvGetTickCount() - t2;
printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
}
printf("-----sub on 2D array------\n");
for(int i=0; i<5; i++)
{
t2 = (double)cvGetTickCount();
for(int j=0; j<240; j++)
{
for(int k=0; k<320; k++)
{
m3[j][k] = m1[j][k]-m2[j][k];
}
}
t2 = (double)cvGetTickCount() - t2;
printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
}each method is performed 5 times to get the mean time
You could consider doing 4 double operations in each for-loop. This would allow modern CPU's to perform the operation in parallel. Taken from Software Optimization Guide for AMD64 Processors[^]:
Rationale and Examples This is especially important to break long dependency chains into smaller executing units in floating-point code, whether it is mapped to x87, SSE, or SSE2 instructions, because of the longer latency of floating-point operations. Because most languages (including ANSI C) guarantee that floating-point expressions are not reordered, compilers cannot usually perform such optimizations unless they offer a switch to allow noncompliant reordering of floating-point expressions according to algebraic rules. Reordered code that is algebraically identical to the original code does not necessarily produce identical computational results due to the lack of associativity of floating-point operations. There are well-known numerical considerations in applying these optimizations (consult a book on numerical analysis). In some cases, these optimizations may lead to unexpected results. In the vast majority of cases, the final result differs only in the least-significant bits. Listing 10. Avoid
double a[100], sum;
int i;
sum = 0.0f;
for (i = 0; i < 100; i++) {
sum += a[i];
}Listing 11. Preferred
double a[100], sum1, sum2, sum3, sum4, sum;
int i;
sum1 = 0.0;
sum2 = 0.0;
sum3 = 0.0;
sum4 = 0.0;
for (i = 0; i < 100; i + 4) {
sum1 += a[i];
sum2 += a[i+1];
sum3 += a[i+2];
sum4 += a[i+3];
}
sum = (sum4 + sum3) + (sum1 + sum2);Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point adder. Each stage of the floating-point adder is occupied on every clock cycle, ensuring maximum sustained utilization.
-
You could consider doing 4 double operations in each for-loop. This would allow modern CPU's to perform the operation in parallel. Taken from Software Optimization Guide for AMD64 Processors[^]:
Rationale and Examples This is especially important to break long dependency chains into smaller executing units in floating-point code, whether it is mapped to x87, SSE, or SSE2 instructions, because of the longer latency of floating-point operations. Because most languages (including ANSI C) guarantee that floating-point expressions are not reordered, compilers cannot usually perform such optimizations unless they offer a switch to allow noncompliant reordering of floating-point expressions according to algebraic rules. Reordered code that is algebraically identical to the original code does not necessarily produce identical computational results due to the lack of associativity of floating-point operations. There are well-known numerical considerations in applying these optimizations (consult a book on numerical analysis). In some cases, these optimizations may lead to unexpected results. In the vast majority of cases, the final result differs only in the least-significant bits. Listing 10. Avoid
double a[100], sum;
int i;
sum = 0.0f;
for (i = 0; i < 100; i++) {
sum += a[i];
}Listing 11. Preferred
double a[100], sum1, sum2, sum3, sum4, sum;
int i;
sum1 = 0.0;
sum2 = 0.0;
sum3 = 0.0;
sum4 = 0.0;
for (i = 0; i < 100; i + 4) {
sum1 += a[i];
sum2 += a[i+1];
sum3 += a[i+2];
sum4 += a[i+3];
}
sum = (sum4 + sum3) + (sum1 + sum2);Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point adder. Each stage of the floating-point adder is occupied on every clock cycle, ensuring maximum sustained utilization.
Snakefoot wrote:
Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point adder.
Notice the four-way unrolling will eat up all the CPU time (and block the application) just to perform the first four operations ;P :)
If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
[My articles] -
Snakefoot wrote:
Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point adder.
Notice the four-way unrolling will eat up all the CPU time (and block the application) just to perform the first four operations ;P :)
If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
[My articles]