Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. fast fast fast

fast fast fast

Scheduled Pinned Locked Moved C / C++ / MFC
data-structuresperformancetutorialquestion
9 Posts 4 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • _ Offline
    _ Offline
    __erfan__
    wrote on last edited by
    #1

    hi. i am writing a video processing application. at first i tried to use OpenCV and it was greate as it had very high performance. but i need to implement some functions by myself and i am doing this. but the main peroblem is the "performance". for example, i used the cvSub(...) funcation to subtract two 2D arrays. and then i tried to do this with a simple "for loop". the calculation times are shown below: openCV : 0.135496 ms mycode : 0.376614 ms and i ran the program several times and the result was the same. so i am here to ask : do you have any idea for performing the task as fast as OpenCV? part of my code :

    double t2;
    printf("-----cvSub on cvMat------\n");
    for(int i=0; i<5; i++)
    {
    t2 = (double)cvGetTickCount();
    cvSub(mat1, mat2, mat3, 0);
    t2 = (double)cvGetTickCount() - t2;
    printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
    }
    printf("-----sub on 2D array------\n");
    for(int i=0; i<5; i++)
    {
    t2 = (double)cvGetTickCount();
    for(int j=0; j<240; j++)
    {
    for(int k=0; k<320; k++)
    {
    m3[j][k] = m1[j][k]-m2[j][k];
    }
    }
    t2 = (double)cvGetTickCount() - t2;
    printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
    }

    each method is performed 5 times to get the mean time

    CPalliniC R R 3 Replies Last reply
    0
    • _ __erfan__

      hi. i am writing a video processing application. at first i tried to use OpenCV and it was greate as it had very high performance. but i need to implement some functions by myself and i am doing this. but the main peroblem is the "performance". for example, i used the cvSub(...) funcation to subtract two 2D arrays. and then i tried to do this with a simple "for loop". the calculation times are shown below: openCV : 0.135496 ms mycode : 0.376614 ms and i ran the program several times and the result was the same. so i am here to ask : do you have any idea for performing the task as fast as OpenCV? part of my code :

      double t2;
      printf("-----cvSub on cvMat------\n");
      for(int i=0; i<5; i++)
      {
      t2 = (double)cvGetTickCount();
      cvSub(mat1, mat2, mat3, 0);
      t2 = (double)cvGetTickCount() - t2;
      printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
      }
      printf("-----sub on 2D array------\n");
      for(int i=0; i<5; i++)
      {
      t2 = (double)cvGetTickCount();
      for(int j=0; j<240; j++)
      {
      for(int k=0; k<320; k++)
      {
      m3[j][k] = m1[j][k]-m2[j][k];
      }
      }
      t2 = (double)cvGetTickCount() - t2;
      printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
      }

      each method is performed 5 times to get the mean time

      CPalliniC Offline
      CPalliniC Offline
      CPallini
      wrote on last edited by
      #2

      Since the OpenCV source code is available, why don't you have a look at it? :)

      If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
      This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
      [My articles]

      In testa che avete, signor di Ceprano?

      _ 1 Reply Last reply
      0
      • _ __erfan__

        hi. i am writing a video processing application. at first i tried to use OpenCV and it was greate as it had very high performance. but i need to implement some functions by myself and i am doing this. but the main peroblem is the "performance". for example, i used the cvSub(...) funcation to subtract two 2D arrays. and then i tried to do this with a simple "for loop". the calculation times are shown below: openCV : 0.135496 ms mycode : 0.376614 ms and i ran the program several times and the result was the same. so i am here to ask : do you have any idea for performing the task as fast as OpenCV? part of my code :

        double t2;
        printf("-----cvSub on cvMat------\n");
        for(int i=0; i<5; i++)
        {
        t2 = (double)cvGetTickCount();
        cvSub(mat1, mat2, mat3, 0);
        t2 = (double)cvGetTickCount() - t2;
        printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
        }
        printf("-----sub on 2D array------\n");
        for(int i=0; i<5; i++)
        {
        t2 = (double)cvGetTickCount();
        for(int j=0; j<240; j++)
        {
        for(int k=0; k<320; k++)
        {
        m3[j][k] = m1[j][k]-m2[j][k];
        }
        }
        t2 = (double)cvGetTickCount() - t2;
        printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
        }

        each method is performed 5 times to get the mean time

        R Offline
        R Offline
        Rajesh R Subramanian
        wrote on last edited by
        #3

        Have you tried disabling the printf command while measuring the performance? I reckon it takes a bit of time. :) You could store all those values to be printed into a vector or something and print it all together at the end (of course don't count that as execution time).

        “Follow your bliss.” – Joseph Campbell

        _ 1 Reply Last reply
        0
        • CPalliniC CPallini

          Since the OpenCV source code is available, why don't you have a look at it? :)

          If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
          This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
          [My articles]

          _ Offline
          _ Offline
          __erfan__
          wrote on last edited by
          #4

          I've searched for it and ... nothing. :doh: if you find that please let me know

          CPalliniC 1 Reply Last reply
          0
          • R Rajesh R Subramanian

            Have you tried disabling the printf command while measuring the performance? I reckon it takes a bit of time. :) You could store all those values to be printed into a vector or something and print it all together at the end (of course don't count that as execution time).

            “Follow your bliss.” – Joseph Campbell

            _ Offline
            _ Offline
            __erfan__
            wrote on last edited by
            #5

            thanks for your help. but printf is out of timming block and is present in both methods

            1 Reply Last reply
            0
            • _ __erfan__

              I've searched for it and ... nothing. :doh: if you find that please let me know

              CPalliniC Offline
              CPalliniC Offline
              CPallini
              wrote on last edited by
              #6

              __erfan__ wrote:

              if you find that please let me know

              I found it (you've to go a bit deep inside the source code): The library uses the SSE2 for integer arithmetic operations (see [^]), in particular _mm_subs_epu8 for cvSub. :)

              If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
              This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
              [My articles]

              In testa che avete, signor di Ceprano?

              1 Reply Last reply
              0
              • _ __erfan__

                hi. i am writing a video processing application. at first i tried to use OpenCV and it was greate as it had very high performance. but i need to implement some functions by myself and i am doing this. but the main peroblem is the "performance". for example, i used the cvSub(...) funcation to subtract two 2D arrays. and then i tried to do this with a simple "for loop". the calculation times are shown below: openCV : 0.135496 ms mycode : 0.376614 ms and i ran the program several times and the result was the same. so i am here to ask : do you have any idea for performing the task as fast as OpenCV? part of my code :

                double t2;
                printf("-----cvSub on cvMat------\n");
                for(int i=0; i<5; i++)
                {
                t2 = (double)cvGetTickCount();
                cvSub(mat1, mat2, mat3, 0);
                t2 = (double)cvGetTickCount() - t2;
                printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
                }
                printf("-----sub on 2D array------\n");
                for(int i=0; i<5; i++)
                {
                t2 = (double)cvGetTickCount();
                for(int j=0; j<240; j++)
                {
                for(int k=0; k<320; k++)
                {
                m3[j][k] = m1[j][k]-m2[j][k];
                }
                }
                t2 = (double)cvGetTickCount() - t2;
                printf( "detection time = %gms\n", t2/((double)cvGetTickFrequency()*1000.) );
                }

                each method is performed 5 times to get the mean time

                R Offline
                R Offline
                Rolf Kristensen
                wrote on last edited by
                #7

                You could consider doing 4 double operations in each for-loop. This would allow modern CPU's to perform the operation in parallel. Taken from Software Optimization Guide for AMD64 Processors[^]:

                Rationale and Examples This is especially important to break long dependency chains into smaller executing units in floating-point code, whether it is mapped to x87, SSE, or SSE2 instructions, because of the longer latency of floating-point operations. Because most languages (including ANSI C) guarantee that floating-point expressions are not reordered, compilers cannot usually perform such optimizations unless they offer a switch to allow noncompliant reordering of floating-point expressions according to algebraic rules. Reordered code that is algebraically identical to the original code does not necessarily produce identical computational results due to the lack of associativity of floating-point operations. There are well-known numerical considerations in applying these optimizations (consult a book on numerical analysis). In some cases, these optimizations may lead to unexpected results. In the vast majority of cases, the final result differs only in the least-significant bits. Listing 10. Avoid

                double a[100], sum;
                int i;
                sum = 0.0f;
                for (i = 0; i < 100; i++) {
                sum += a[i];
                }

                Listing 11. Preferred

                double a[100], sum1, sum2, sum3, sum4, sum;
                int i;
                sum1 = 0.0;
                sum2 = 0.0;
                sum3 = 0.0;
                sum4 = 0.0;
                for (i = 0; i < 100; i + 4) {
                sum1 += a[i];
                sum2 += a[i+1];
                sum3 += a[i+2];
                sum4 += a[i+3];
                }
                sum = (sum4 + sum3) + (sum1 + sum2);

                Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point adder. Each stage of the floating-point adder is occupied on every clock cycle, ensuring maximum sustained utilization.

                CPalliniC 1 Reply Last reply
                0
                • R Rolf Kristensen

                  You could consider doing 4 double operations in each for-loop. This would allow modern CPU's to perform the operation in parallel. Taken from Software Optimization Guide for AMD64 Processors[^]:

                  Rationale and Examples This is especially important to break long dependency chains into smaller executing units in floating-point code, whether it is mapped to x87, SSE, or SSE2 instructions, because of the longer latency of floating-point operations. Because most languages (including ANSI C) guarantee that floating-point expressions are not reordered, compilers cannot usually perform such optimizations unless they offer a switch to allow noncompliant reordering of floating-point expressions according to algebraic rules. Reordered code that is algebraically identical to the original code does not necessarily produce identical computational results due to the lack of associativity of floating-point operations. There are well-known numerical considerations in applying these optimizations (consult a book on numerical analysis). In some cases, these optimizations may lead to unexpected results. In the vast majority of cases, the final result differs only in the least-significant bits. Listing 10. Avoid

                  double a[100], sum;
                  int i;
                  sum = 0.0f;
                  for (i = 0; i < 100; i++) {
                  sum += a[i];
                  }

                  Listing 11. Preferred

                  double a[100], sum1, sum2, sum3, sum4, sum;
                  int i;
                  sum1 = 0.0;
                  sum2 = 0.0;
                  sum3 = 0.0;
                  sum4 = 0.0;
                  for (i = 0; i < 100; i + 4) {
                  sum1 += a[i];
                  sum2 += a[i+1];
                  sum3 += a[i+2];
                  sum4 += a[i+3];
                  }
                  sum = (sum4 + sum3) + (sum1 + sum2);

                  Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point adder. Each stage of the floating-point adder is occupied on every clock cycle, ensuring maximum sustained utilization.

                  CPalliniC Offline
                  CPalliniC Offline
                  CPallini
                  wrote on last edited by
                  #8

                  Snakefoot wrote:

                  Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point adder.

                  Notice the four-way unrolling will eat up all the CPU time (and block the application) just to perform the first four operations ;P :)

                  If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
                  This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
                  [My articles]

                  In testa che avete, signor di Ceprano?

                  _ 1 Reply Last reply
                  0
                  • CPalliniC CPallini

                    Snakefoot wrote:

                    Notice that the four-way unrolling is chosen to exploit the four-stage fully pipelined floating-point adder.

                    Notice the four-way unrolling will eat up all the CPU time (and block the application) just to perform the first four operations ;P :)

                    If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
                    This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
                    [My articles]

                    _ Offline
                    _ Offline
                    __erfan__
                    wrote on last edited by
                    #9

                    thank you guys. my brain is telling me go to sleep i am on my computer for 24 hours and ... :zzz:

                    1 Reply Last reply
                    0
                    Reply
                    • Reply as topic
                    Log in to reply
                    • Oldest to Newest
                    • Newest to Oldest
                    • Most Votes


                    • Login

                    • Don't have an account? Register

                    • Login or register to search.
                    • First post
                      Last post
                    0
                    • Categories
                    • Recent
                    • Tags
                    • Popular
                    • World
                    • Users
                    • Groups