Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Algorithms
  4. How is BitBlt implemented [modified]

How is BitBlt implemented [modified]

Scheduled Pinned Locked Moved Algorithms
graphicsadobehardwareannouncement
34 Posts 13 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L Luc Pattyn

    Most of the people active on this forum use Windows; when they would need BitBlt, they would call the function, without worrying too much how it is implemented (that was Microsoft's job). If you want to implement your own BitBlt function, you will run into performance issues; the general lines to maximize performance on pixel operations are: avoid testing and branching, avoid function calls, avoid unnecessary moves, and try to handle as many pixels as possible in a single operation. The details of all this are way beyond the scope of a forum such as this, as I already indicated. With the partial description you provided, I would judge your table look-up is unlikely to provide any speed-up unless it replaces two of the three images, which probably it isn't. A Pentium-class CPU is better at reading a couple of pixel streams (which is a very predictable process, and a prefetch can be organized), than at looking up tables; and it is unlikely the table replaces more than a couple of (fast) instructions anyway. :)

    Luc Pattyn [Forum Guidelines] [My Articles]


    this months tips: - use PRE tags to preserve formatting when showing multi-line code snippets - before you ask a question here, search CodeProject, then Google


    F Offline
    F Offline
    Force Code
    wrote on last edited by
    #8

    Luc Pattyn wrote:

    With the partial description you provided, I would judge your table look-up is unlikely to provide any speed-up unless it replaces two of the three images, which probably it isn't.

    Something just occurred to me: BYTE and[256][256] BYTE or[256][256] BYTE not[256]; That's less than 150K. The question is, could and[x][y] be retrieved quicker than x&y is calculated. I suppose some table like this might already be used behind the scenes if it is actually quicker though. ------ Sorry, that was stupid. But here's what I might try: BYTE ROP[256][256][256]; This would store every possible byte value for the Raster Op I'm doing. I think one table look up would be quicker than three bitwise logical ops. But how much quicker, and is it worth a 16MB table. Guess I'll find out. -- modified at 17:21 Thursday 15th November, 2007

    L 1 Reply Last reply
    0
    • F Force Code

      Luc Pattyn wrote:

      With the partial description you provided, I would judge your table look-up is unlikely to provide any speed-up unless it replaces two of the three images, which probably it isn't.

      Something just occurred to me: BYTE and[256][256] BYTE or[256][256] BYTE not[256]; That's less than 150K. The question is, could and[x][y] be retrieved quicker than x&y is calculated. I suppose some table like this might already be used behind the scenes if it is actually quicker though. ------ Sorry, that was stupid. But here's what I might try: BYTE ROP[256][256][256]; This would store every possible byte value for the Raster Op I'm doing. I think one table look up would be quicker than three bitwise logical ops. But how much quicker, and is it worth a 16MB table. Guess I'll find out. -- modified at 17:21 Thursday 15th November, 2007

      L Offline
      L Offline
      Luc Pattyn
      wrote on last edited by
      #9

      Hi, The optimization of image processing code is handled at the cycle level, that is why C/assembly/MMX/SSE are required here. If you want to explain an approach in a higher level, that is fine to express the concept, but for performance you must visualize the basic CPU operations that will correspond. if BYTE ROP[x][y][z] contains a random value, of course you need a table look-up. At best, these are the basic operations that would occur (assume x,y,z in CPU registers): copy x, shift it by 8 positions, add y to it, shift it by 8 positions, add z to it, add the address of ROP[0][0][0] to it, do an indirect load; so that amounts to 6 simple arithmetic operations and a load. as long as the value of BYTE ROP[x][y][z] can be expressed as a function of x,y,z containing less than say 6 logical operators the direct code would be faster, since it does not need the load at all. So what I am saying is this: on modern CPUs (Pentium class, RISC, and many others) u= x&y | x^z | !y&z would be faster than u=ROP[x][y][z] without any doubt, (say by a factor of 1.33 to 2) From what you have shown so far, I can only conclude that you are a novice in optimization stuff. If you want to continue in this field, I recommend you study some code outside the .NET Framework, and look at assemly code generated (a) by compilers, (b) by manual optimization (e.g. look at the disassembled code of the strlen function, you will find over 100 lines for something that most people would do, somewhat slower, in less than 10 lines! The difference is not much, but that's mainly because strlen is hard to parallellize, although there are MMX versions of it too). :)

      Luc Pattyn [Forum Guidelines] [My Articles]


      this months tips: - use PRE tags to preserve formatting when showing multi-line code snippets - before you ask a question here, search CodeProject, then Google


      F 1 Reply Last reply
      0
      • L Luc Pattyn

        Hi, The optimization of image processing code is handled at the cycle level, that is why C/assembly/MMX/SSE are required here. If you want to explain an approach in a higher level, that is fine to express the concept, but for performance you must visualize the basic CPU operations that will correspond. if BYTE ROP[x][y][z] contains a random value, of course you need a table look-up. At best, these are the basic operations that would occur (assume x,y,z in CPU registers): copy x, shift it by 8 positions, add y to it, shift it by 8 positions, add z to it, add the address of ROP[0][0][0] to it, do an indirect load; so that amounts to 6 simple arithmetic operations and a load. as long as the value of BYTE ROP[x][y][z] can be expressed as a function of x,y,z containing less than say 6 logical operators the direct code would be faster, since it does not need the load at all. So what I am saying is this: on modern CPUs (Pentium class, RISC, and many others) u= x&y | x^z | !y&z would be faster than u=ROP[x][y][z] without any doubt, (say by a factor of 1.33 to 2) From what you have shown so far, I can only conclude that you are a novice in optimization stuff. If you want to continue in this field, I recommend you study some code outside the .NET Framework, and look at assemly code generated (a) by compilers, (b) by manual optimization (e.g. look at the disassembled code of the strlen function, you will find over 100 lines for something that most people would do, somewhat slower, in less than 10 lines! The difference is not much, but that's mainly because strlen is hard to parallellize, although there are MMX versions of it too). :)

        Luc Pattyn [Forum Guidelines] [My Articles]


        this months tips: - use PRE tags to preserve formatting when showing multi-line code snippets - before you ask a question here, search CodeProject, then Google


        F Offline
        F Offline
        Force Code
        wrote on last edited by
        #10

        Luc Pattyn wrote:

        From what you have shown so far, I can only conclude that you are a novice in optimization stuff.

        For the record, I've had two courses in digital logic at the graduate level, (pursuant to my master's degree) but that type of optimization isn't really an option for me here, is it? I certainly know assembler as well. Remember my original problem, however: I am trying to emulate BitBlt with in the actionscript/flash environment which doesn't have it. My goal is to eak out as much speed for the operation in that environment, and see what I come up with. In my previous post I was thinking out loud, essentially in an attempt to get some meaningful feedback for my desired goal, which BTW, I have yet to recieve. Essentially what you're implying is that there is NO WAY to achieve a faster emulation of BitBlt in a flash environment other than a pixel by pixel operation. I appreciate your explanation as to why a table lookup would not be faster than the bitwise logical operation, but once again how have you helped me towards solving my original problem at all? Respectfully, Force Code ---------------------------- "if BYTE ROP[x][y][z] contains a random value, of course you need a table look-up. At best, these are the basic operations that would occur (assume x,y,z in CPU registers): copy x, shift it by 8 positions, add y to it, shift it by 8 positions, add z to it, add the address of ROP[0][0][0] to it, do an indirect load; so that amounts to 6 simplearithmetic operations and a load." I am not at all convinced that all that copying and shifting and adding you describe is in fact the operations involved in a table lookup: My thinking first of all that a compiler would convert x,y,z into a single value and store it in one register and use it as an offset from a base address - there go your litany of copies and such. (But you said it with such authority - I almost assumed for a minute you knew what you were talking about.) -- modified at 19:39 Thursday 15th November, 2007

        L C 2 Replies Last reply
        0
        • F Force Code

          Luc Pattyn wrote:

          From what you have shown so far, I can only conclude that you are a novice in optimization stuff.

          For the record, I've had two courses in digital logic at the graduate level, (pursuant to my master's degree) but that type of optimization isn't really an option for me here, is it? I certainly know assembler as well. Remember my original problem, however: I am trying to emulate BitBlt with in the actionscript/flash environment which doesn't have it. My goal is to eak out as much speed for the operation in that environment, and see what I come up with. In my previous post I was thinking out loud, essentially in an attempt to get some meaningful feedback for my desired goal, which BTW, I have yet to recieve. Essentially what you're implying is that there is NO WAY to achieve a faster emulation of BitBlt in a flash environment other than a pixel by pixel operation. I appreciate your explanation as to why a table lookup would not be faster than the bitwise logical operation, but once again how have you helped me towards solving my original problem at all? Respectfully, Force Code ---------------------------- "if BYTE ROP[x][y][z] contains a random value, of course you need a table look-up. At best, these are the basic operations that would occur (assume x,y,z in CPU registers): copy x, shift it by 8 positions, add y to it, shift it by 8 positions, add z to it, add the address of ROP[0][0][0] to it, do an indirect load; so that amounts to 6 simplearithmetic operations and a load." I am not at all convinced that all that copying and shifting and adding you describe is in fact the operations involved in a table lookup: My thinking first of all that a compiler would convert x,y,z into a single value and store it in one register and use it as an offset from a base address - there go your litany of copies and such. (But you said it with such authority - I almost assumed for a minute you knew what you were talking about.) -- modified at 19:39 Thursday 15th November, 2007

          L Offline
          L Offline
          Luc Pattyn
          wrote on last edited by
          #11

          Force Code wrote:

          I am not at all convinced that all that copying and shifting and adding you describe is in fact the operations involved in a table lookup

          well that too proves my point. how do you think a CPU gets to an element of a three- dimensional array?

          Force Code wrote:

          that there is NO WAY to achieve a faster emulation of BitBlt in a flash environment other than a pixel by pixel operation

          that's not what I said; to the contrary, I told you looking for parallellism is a major way to speed up image processing. To what degree it is possible in your case I can't tell. I am not going to spend the time and solve your problem, I don't even plan on investigating it any further. I have told you how complex your goal may be and what is important to image processing performance. I think you are looking at a major undertaking and probably will be disappointed by the results that you will obtain, but hey don't let anyone discourage you...

          Luc Pattyn [Forum Guidelines] [My Articles]


          this months tips: - use PRE tags to preserve formatting when showing multi-line code snippets - before you ask a question here, search CodeProject, then Google


          1 Reply Last reply
          0
          • F Force Code

            Luc Pattyn wrote:

            Windows apps can call on the built-in BitBlt functions[^].

            The Windows BitBlt was the one I was referring to, and I am familiar with the documented interface for it.

            Luc Pattyn wrote:

            Google (BilBlt, rasterop, theory, internals, ...) will lead you to some junk and some code snippets; there must be some C/C++/Java implementations around. But again, performance wise none of those will match specialized products

            You could save yourself some typing by just responding to every post with "Google it". (Incidentally why does this forum even exist when a link to google is all that's needed.) There are several people in this forum with "Microsoft MVP" in their signature. Don't know if that's literal or some code unique to this forum, but maybe I was thinking someone like that might run across this thread. As far as my problem, I have three bitmaps - two of them never change position in relation to the client area, and the third scrolls up and down. With each scroll the BitBlt w/ ROP code has to be done. I'm thinking now since the first two bitmaps don't move, I can precalclulate values for them and store them in another table maybe 10-15 megs in size. So, part of the ROP calculation could be replaced by an indexed table look-up, though given the ROP code, I don't know how to do this yet, and furthermore don't know if there's any time saving with this method.

            D Offline
            D Offline
            DQNOK
            wrote on last edited by
            #12

            This was unnecessarily harsh. Luc was being very professional and trying to help. From reading the rest of the discourse, it is very obvious to me who knows what he is talking about, and who does not.

            David --------- Empirical studies indicate that 20% of the people drink 80% of the beer. With C++ developers, the rule is that 80% of the developers understand at most 20% of the language. It is not the same 20% for different people, so don't count on them to understand each other's code. http://yosefk.com/c++fqa/picture.html#fqa-6.6 ---------

            L 1 Reply Last reply
            0
            • D DQNOK

              This was unnecessarily harsh. Luc was being very professional and trying to help. From reading the rest of the discourse, it is very obvious to me who knows what he is talking about, and who does not.

              David --------- Empirical studies indicate that 20% of the people drink 80% of the beer. With C++ developers, the rule is that 80% of the developers understand at most 20% of the language. It is not the same 20% for different people, so don't count on them to understand each other's code. http://yosefk.com/c++fqa/picture.html#fqa-6.6 ---------

              L Offline
              L Offline
              Luc Pattyn
              wrote on last edited by
              #13

              :rose:

              F 1 Reply Last reply
              0
              • F Force Code

                Luc Pattyn wrote:

                From what you have shown so far, I can only conclude that you are a novice in optimization stuff.

                For the record, I've had two courses in digital logic at the graduate level, (pursuant to my master's degree) but that type of optimization isn't really an option for me here, is it? I certainly know assembler as well. Remember my original problem, however: I am trying to emulate BitBlt with in the actionscript/flash environment which doesn't have it. My goal is to eak out as much speed for the operation in that environment, and see what I come up with. In my previous post I was thinking out loud, essentially in an attempt to get some meaningful feedback for my desired goal, which BTW, I have yet to recieve. Essentially what you're implying is that there is NO WAY to achieve a faster emulation of BitBlt in a flash environment other than a pixel by pixel operation. I appreciate your explanation as to why a table lookup would not be faster than the bitwise logical operation, but once again how have you helped me towards solving my original problem at all? Respectfully, Force Code ---------------------------- "if BYTE ROP[x][y][z] contains a random value, of course you need a table look-up. At best, these are the basic operations that would occur (assume x,y,z in CPU registers): copy x, shift it by 8 positions, add y to it, shift it by 8 positions, add z to it, add the address of ROP[0][0][0] to it, do an indirect load; so that amounts to 6 simplearithmetic operations and a load." I am not at all convinced that all that copying and shifting and adding you describe is in fact the operations involved in a table lookup: My thinking first of all that a compiler would convert x,y,z into a single value and store it in one register and use it as an offset from a base address - there go your litany of copies and such. (But you said it with such authority - I almost assumed for a minute you knew what you were talking about.) -- modified at 19:39 Thursday 15th November, 2007

                C Offline
                C Offline
                CPallini
                wrote on last edited by
                #14

                You are quite rude, man. :mad: And it is clear you aren't mastering the argument.:) On the other hand, Luc was honestly attempting to help you, pointing out (supported by technical competence) as daunting will be your task. So, please, don't bore us with your arrogance and rudeness. :)

                If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.

                L 1 Reply Last reply
                0
                • C CPallini

                  You are quite rude, man. :mad: And it is clear you aren't mastering the argument.:) On the other hand, Luc was honestly attempting to help you, pointing out (supported by technical competence) as daunting will be your task. So, please, don't bore us with your arrogance and rudeness. :)

                  If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.

                  L Offline
                  L Offline
                  Luc Pattyn
                  wrote on last edited by
                  #15

                  :rose:

                  1 Reply Last reply
                  0
                  • L Luc Pattyn

                    :rose:

                    F Offline
                    F Offline
                    Force Code
                    wrote on last edited by
                    #16

                    Luc Pattyn wrote:

                    So what I am saying is this: on modern CPUs (Pentium class, RISC, and many others) u= x&y | x^z | !y&z would be faster than u=ROP[x][y][z] without any doubt, (say by a factor of 1.33 to 2)

                    #include < iostream >
                    #include < windows.h >

                    using namespace std;

                    BYTE ROP[256][256][256];

                    int main(int argc, char *argv[])
                    {
                    DWORD time = GetCurrentTime();
                    for (int x=0; x<256; x++)
                    for (int y=0; y<256; y++)
                    for (int z=0; z<256; z++)
                    ROP[x][y][z] = (x&y) | (x^z) | (!y&z);
                    cout << "test 1: " << GetCurrentTime()-time << endl;

                    BYTE n;
                    time = GetCurrentTime();
                    for (int x=0; x<256; x++)
                    for (int y=0; y<256; y++)
                    for (int z=0; z<256; z++)
                    n = ROP[x][y][z];
                    cout << "test 2: " << GetCurrentTime()-time << endl;
                    }

                    And the results: C:\dev-cpp\examples\test>test test 1: 240 test 2: 60

                    L S F 3 Replies Last reply
                    0
                    • F Force Code

                      Luc Pattyn wrote:

                      So what I am saying is this: on modern CPUs (Pentium class, RISC, and many others) u= x&y | x^z | !y&z would be faster than u=ROP[x][y][z] without any doubt, (say by a factor of 1.33 to 2)

                      #include < iostream >
                      #include < windows.h >

                      using namespace std;

                      BYTE ROP[256][256][256];

                      int main(int argc, char *argv[])
                      {
                      DWORD time = GetCurrentTime();
                      for (int x=0; x<256; x++)
                      for (int y=0; y<256; y++)
                      for (int z=0; z<256; z++)
                      ROP[x][y][z] = (x&y) | (x^z) | (!y&z);
                      cout << "test 1: " << GetCurrentTime()-time << endl;

                      BYTE n;
                      time = GetCurrentTime();
                      for (int x=0; x<256; x++)
                      for (int y=0; y<256; y++)
                      for (int z=0; z<256; z++)
                      n = ROP[x][y][z];
                      cout << "test 2: " << GetCurrentTime()-time << endl;
                      }

                      And the results: C:\dev-cpp\examples\test>test test 1: 240 test 2: 60

                      L Offline
                      L Offline
                      Luc Pattyn
                      wrote on last edited by
                      #17

                      Force Code wrote:

                      test 1: 240 test 2: 60

                      and a release build gives: test 1: 125 test 2: 0 proving the test program is worth zilch. X| My final statement to you is: try to figure what the corresponding assembly code is.

                      Luc Pattyn [Forum Guidelines] [My Articles]


                      this months tips: - before you ask a question here, search CodeProject, then Google - the quality and detail of your question reflects on the effectiveness of the help you are likely to get - use PRE tags to preserve formatting when showing multi-line code snippets


                      1 Reply Last reply
                      0
                      • F Force Code

                        Luc Pattyn wrote:

                        So what I am saying is this: on modern CPUs (Pentium class, RISC, and many others) u= x&y | x^z | !y&z would be faster than u=ROP[x][y][z] without any doubt, (say by a factor of 1.33 to 2)

                        #include < iostream >
                        #include < windows.h >

                        using namespace std;

                        BYTE ROP[256][256][256];

                        int main(int argc, char *argv[])
                        {
                        DWORD time = GetCurrentTime();
                        for (int x=0; x<256; x++)
                        for (int y=0; y<256; y++)
                        for (int z=0; z<256; z++)
                        ROP[x][y][z] = (x&y) | (x^z) | (!y&z);
                        cout << "test 1: " << GetCurrentTime()-time << endl;

                        BYTE n;
                        time = GetCurrentTime();
                        for (int x=0; x<256; x++)
                        for (int y=0; y<256; y++)
                        for (int z=0; z<256; z++)
                        n = ROP[x][y][z];
                        cout << "test 2: " << GetCurrentTime()-time << endl;
                        }

                        And the results: C:\dev-cpp\examples\test>test test 1: 240 test 2: 60

                        S Offline
                        S Offline
                        Skippums
                        wrote on last edited by
                        #18

                        What are you trying to prove with that benchmark? You are completely comparing apples to oranges here. The first is measuring how long it takes to WRITE some COMPUTED value to some COMPUTED OFFSET. The second is measuring how long it takes to READ some STORED value from the same COMPUTED OFFSET. In addition, the compiler may be moving x, y, or z out of the register space to allow for the arithmetic in the first example, so you may have additional memory reading/writing going on (not to mention that reads are far faster than writes). I bet if you store the result of the first loop to a single int variable, the first will be faster. If this was an attempt to prove that table lookups are faster than actual computing, then you must know that I am ABSOLUTELY not convinced. As far as your imaging is concerned, if you are using c#, then you should use either int or long pointers to the data, and perform your operations on the pointer to a linear array. I just did some image processing stuff, and this proved to be the most efficient in c# that I found. Jeff

                        1 Reply Last reply
                        0
                        • F Force Code

                          Luc Pattyn wrote:

                          So what I am saying is this: on modern CPUs (Pentium class, RISC, and many others) u= x&y | x^z | !y&z would be faster than u=ROP[x][y][z] without any doubt, (say by a factor of 1.33 to 2)

                          #include < iostream >
                          #include < windows.h >

                          using namespace std;

                          BYTE ROP[256][256][256];

                          int main(int argc, char *argv[])
                          {
                          DWORD time = GetCurrentTime();
                          for (int x=0; x<256; x++)
                          for (int y=0; y<256; y++)
                          for (int z=0; z<256; z++)
                          ROP[x][y][z] = (x&y) | (x^z) | (!y&z);
                          cout << "test 1: " << GetCurrentTime()-time << endl;

                          BYTE n;
                          time = GetCurrentTime();
                          for (int x=0; x<256; x++)
                          for (int y=0; y<256; y++)
                          for (int z=0; z<256; z++)
                          n = ROP[x][y][z];
                          cout << "test 2: " << GetCurrentTime()-time << endl;
                          }

                          And the results: C:\dev-cpp\examples\test>test test 1: 240 test 2: 60

                          F Offline
                          F Offline
                          Force Code
                          wrote on last edited by
                          #19

                          OK, it looks like maybe I was mistaken:

                          #include < iostream >
                          #include < windows.h >

                          using namespace std;

                          BYTE ROP[256][256][256];
                          BYTE ROP2[256][256][256];

                          int main(int argc, char *argv[])
                          {
                          DWORD time = GetCurrentTime();
                          for (int x=0; x<256; x++)
                          for (int y=0; y<256; y++)
                          for (int z=0; z<256; z++)
                          ROP[x][y][z] = (x&y) | (x^z) | (!y&z);
                          cout << "test 1: " << GetCurrentTime()-time << endl;

                          time = GetCurrentTime();
                          for (int x=0; x<256; x++)
                          for (int y=0; y<256; y++)
                          for (int z=0; z<256; z++)
                          ROP2[x][y][z] = ROP[x][y][z];
                          cout << "test 2: " << GetCurrentTime()-time << endl;

                          //
                          time = GetCurrentTime();
                          memcpy((void*)ROP2,(void*)ROP,16777216);
                          cout << "test 3: " << GetCurrentTime()-time << endl;

                          }

                          and the results: C:\dev-cpp\examples\test>test test 1: 191 test 2: 310 test 3: 250 But the really strange thing is, in the third case, its just a block transfer of data from ROP to ROP2, but even that's slower than iterating through the entire array and calculating each value one at a time. I'll admit that makes no sense to me.

                          J A 2 Replies Last reply
                          0
                          • F Force Code

                            OK, it looks like maybe I was mistaken:

                            #include < iostream >
                            #include < windows.h >

                            using namespace std;

                            BYTE ROP[256][256][256];
                            BYTE ROP2[256][256][256];

                            int main(int argc, char *argv[])
                            {
                            DWORD time = GetCurrentTime();
                            for (int x=0; x<256; x++)
                            for (int y=0; y<256; y++)
                            for (int z=0; z<256; z++)
                            ROP[x][y][z] = (x&y) | (x^z) | (!y&z);
                            cout << "test 1: " << GetCurrentTime()-time << endl;

                            time = GetCurrentTime();
                            for (int x=0; x<256; x++)
                            for (int y=0; y<256; y++)
                            for (int z=0; z<256; z++)
                            ROP2[x][y][z] = ROP[x][y][z];
                            cout << "test 2: " << GetCurrentTime()-time << endl;

                            //
                            time = GetCurrentTime();
                            memcpy((void*)ROP2,(void*)ROP,16777216);
                            cout << "test 3: " << GetCurrentTime()-time << endl;

                            }

                            and the results: C:\dev-cpp\examples\test>test test 1: 191 test 2: 310 test 3: 250 But the really strange thing is, in the third case, its just a block transfer of data from ROP to ROP2, but even that's slower than iterating through the entire array and calculating each value one at a time. I'll admit that makes no sense to me.

                            J Offline
                            J Offline
                            Jheriko
                            wrote on last edited by
                            #20

                            memcpy is not /that/ fast in most implementations (you can almost always use mmx/sse2 to write a faster one if you want) consider for one thing that the details of the copy are passed as parameters, it must at least do some work to use them, this is a waste if you already know what they are (the for loop tells the compiler for you) also when operating on memory some instructions do not need an additional mov, which means they can be as fast, or faster, than a memory copy, others force you to do two movs (mul and div for instance)

                            1 Reply Last reply
                            0
                            • F Force Code

                              I am trying to write some graphics code using the web-development tool Adobe ActionScript (version 3). AS3 has many sophisticated graphics capabilities, but what it does not have apparently, is any sort of BitBlt capability with raster operation (ROP) codes. This may have something to do with it being based on 'Sprites' which from my brief persual of some wikipedia info, is a different philosophy altogether from BitBlt. I was trying to emulate BitBlt from scratch in ActionScript, by merely dumping three bitmaps into arrays, and then going through them pixel by pixel doing the logical ops on each pixel. As could probably be expected, this is way too slow. It was taking a full second to do ROP bitmap combinations that BitBlt would do instantaneously. (This has nothing to do with any inherent limitation in Actionscript, which does plenty of things graphical instantaneously.) So I need to find out the algorithms that BitBlt is employing to do what it does and if its possible for me to emulate that (assuming that BitBlt isn't relying primarily on graphics hardware that can't be accessed from a web application.) Important point: There is only one ROP code I need to use, so possibly it could be optimized on that basis. (Its one of the more complex ROP codes involving XOR and AND of three different bitmaps). -- modified at 14:44 Thursday 15th November, 2007

                              C Offline
                              C Offline
                              Chris Losinger
                              wrote on last edited by
                              #21

                              IIRC, it's implemented by the specific device driver (and the real work probably happens on the device hardware itself). and, there's no way you're going to even approach the speed of something like BitBlt in a scripting language.

                              image processing toolkits | batch image processing

                              1 Reply Last reply
                              0
                              • F Force Code

                                "Hi, image operations in professional packages are implemented using highly optimized code, typically a combination of C and assembly code; where possible SIMD techniques are applied (that's MMX/SSE on Intel Pentium). It is the result of extensive research, so you won'tmatch its performance easily." If it were documented anywhere, of course I would be able to emulate it (w/ exception of hardware capabilities). There's no apparent performance hit merely from using ActionScript either, which does some things much faster (e.g. antialiasing) than can be done with Windows code, although I have no idea how its done.

                                M Offline
                                M Offline
                                Mark Churchill
                                wrote on last edited by
                                #22

                                IIRC ActionScript is basically JavaScript, and like any interpreted language, its very slow. Sure, if you call an inbuilt antialiasing function or drawline routine, they will execute fast - but they are not written in ActionScript, just made avaliable to ActionScript. There would be no hope of matching a native code implementation with pure ActionScript. I would start looking for a workaround - does AS provide say a copy-paste between canvas, for example?

                                Mark Churchill Director Dunn & Churchill

                                F 1 Reply Last reply
                                0
                                • M Mark Churchill

                                  IIRC ActionScript is basically JavaScript, and like any interpreted language, its very slow. Sure, if you call an inbuilt antialiasing function or drawline routine, they will execute fast - but they are not written in ActionScript, just made avaliable to ActionScript. There would be no hope of matching a native code implementation with pure ActionScript. I would start looking for a workaround - does AS provide say a copy-paste between canvas, for example?

                                  Mark Churchill Director Dunn & Churchill

                                  F Offline
                                  F Offline
                                  Force Code
                                  wrote on last edited by
                                  #23

                                  Mark Churchill wrote:

                                  There would be no hope of matching a native code implementation with pure ActionScript. I would start looking for a workaround - does AS provide say a copy-paste between canvas, for example? Mark Churchill Director Dunn & Churchill

                                  They do have a CopyPixels function, which is like BitBlt without ROP codes. AS also offers a number of different things like BlendModes and filters and merges and so on, of bitmaps. What they don't offer is a straightforward bitwise logical combinations, so if you have something very specific you need to do that doesn't match up with their esoteric sounding functions you're evidently out of luck. For example with BlendMode.Add you have "Adds the values of the constituent colors of the display object to the colors of its background, applying a ceiling of 0xFF. This setting is commonly used for animating a lightening dissolve between two objects." In addition you have things like BlendMode.Multiply BlendMode.Screen, etc. There's even a BlendMode.Invert which is a bitwise NOT, but no AND or OR, for some reason - maybe Microsoft holds a copyright on two of the three logical operators, who knows. I'm sure all these operations are plenty fast as well, judging from my experience with other things in AS (Flex actually). With filters, you have things like convolution filters, displacement filters that look somewhat promising if you're able to decipher what they actually do. There's many other things as well - matrices, alpha, etc.

                                  1 Reply Last reply
                                  0
                                  • F Force Code

                                    I am trying to write some graphics code using the web-development tool Adobe ActionScript (version 3). AS3 has many sophisticated graphics capabilities, but what it does not have apparently, is any sort of BitBlt capability with raster operation (ROP) codes. This may have something to do with it being based on 'Sprites' which from my brief persual of some wikipedia info, is a different philosophy altogether from BitBlt. I was trying to emulate BitBlt from scratch in ActionScript, by merely dumping three bitmaps into arrays, and then going through them pixel by pixel doing the logical ops on each pixel. As could probably be expected, this is way too slow. It was taking a full second to do ROP bitmap combinations that BitBlt would do instantaneously. (This has nothing to do with any inherent limitation in Actionscript, which does plenty of things graphical instantaneously.) So I need to find out the algorithms that BitBlt is employing to do what it does and if its possible for me to emulate that (assuming that BitBlt isn't relying primarily on graphics hardware that can't be accessed from a web application.) Important point: There is only one ROP code I need to use, so possibly it could be optimized on that basis. (Its one of the more complex ROP codes involving XOR and AND of three different bitmaps). -- modified at 14:44 Thursday 15th November, 2007

                                    E Offline
                                    E Offline
                                    El Corazon
                                    wrote on last edited by
                                    #24

                                    Force Code wrote:

                                    So I need to find out the algorithms that BitBlt is employing to do what it does and if its possible for me to emulate that (assuming that BitBlt isn't relying primarily on graphics hardware that can't be accessed from a web application.)

                                    did you try googling bitblt.c? Although the current version has gone through some changes over the years, the code is as old as I am. BitBlt operations were done before accelerated graphics cards for memory manipulaton. It was these routines that were used over and over again that became part of hardware operations on the graphics card (2D acceleration), and eventually part of multimedia and extended instruction sets of the CPU itself. When you find it, you will discover why it was done in hardware to speed it up. I am with the others, you are not going to be happy with the performance. Hardware operations of BitBlt will be nice and smooth, you can achieve the same on a fast processor, but you are wasting a LOT of processor cycles that could be more effectively used on something more important than reinventing the wheel.

                                    _________________________ Asu no koto o ieba, tenjo de nezumi ga warau. Talk about things of tomorrow and the mice in the ceiling laugh. (Japanese Proverb)

                                    F 1 Reply Last reply
                                    0
                                    • E El Corazon

                                      Force Code wrote:

                                      So I need to find out the algorithms that BitBlt is employing to do what it does and if its possible for me to emulate that (assuming that BitBlt isn't relying primarily on graphics hardware that can't be accessed from a web application.)

                                      did you try googling bitblt.c? Although the current version has gone through some changes over the years, the code is as old as I am. BitBlt operations were done before accelerated graphics cards for memory manipulaton. It was these routines that were used over and over again that became part of hardware operations on the graphics card (2D acceleration), and eventually part of multimedia and extended instruction sets of the CPU itself. When you find it, you will discover why it was done in hardware to speed it up. I am with the others, you are not going to be happy with the performance. Hardware operations of BitBlt will be nice and smooth, you can achieve the same on a fast processor, but you are wasting a LOT of processor cycles that could be more effectively used on something more important than reinventing the wheel.

                                      _________________________ Asu no koto o ieba, tenjo de nezumi ga warau. Talk about things of tomorrow and the mice in the ceiling laugh. (Japanese Proverb)

                                      F Offline
                                      F Offline
                                      Force Code
                                      wrote on last edited by
                                      #25

                                      El Corazon wrote:

                                      did you try googling bitblt.c? Although the current version has gone through some changes over the years, the code is as old as I am. BitBlt operations were done before accelerated graphics cards for memory manipulaton. It was these routines that were used over and over again that became part of hardware operations on the graphics card (2D acceleration), and eventually part of multimedia and extended instruction sets of the CPU itself.

                                      Thanks - this is useful tip. If BitBlt is encoded in hardware now, then presumably flash could access it, and probably they already are accessing it, because as I mentioned in a a reply yesterday, they have all sorts of "Blendmodes", filters, merges and so on, that are possible with bitmaps. What they don't have is the ability to specify a logical combination of bitmaps using bitwise operators. The more I think of it, its probably because Microsoft has copyrighted the ROP code concept. The raster op I actually need to use is 00AC0744, which corresponds to the operation (((S ^ D) & P) ^ S), where S,P,D are bitmaps. If anyone knows how to accomplish this using any of the various graphical operations that are a part of the actionscript or flex environment, you would get my sincere thanks.

                                      E D 2 Replies Last reply
                                      0
                                      • F Force Code

                                        El Corazon wrote:

                                        did you try googling bitblt.c? Although the current version has gone through some changes over the years, the code is as old as I am. BitBlt operations were done before accelerated graphics cards for memory manipulaton. It was these routines that were used over and over again that became part of hardware operations on the graphics card (2D acceleration), and eventually part of multimedia and extended instruction sets of the CPU itself.

                                        Thanks - this is useful tip. If BitBlt is encoded in hardware now, then presumably flash could access it, and probably they already are accessing it, because as I mentioned in a a reply yesterday, they have all sorts of "Blendmodes", filters, merges and so on, that are possible with bitmaps. What they don't have is the ability to specify a logical combination of bitmaps using bitwise operators. The more I think of it, its probably because Microsoft has copyrighted the ROP code concept. The raster op I actually need to use is 00AC0744, which corresponds to the operation (((S ^ D) & P) ^ S), where S,P,D are bitmaps. If anyone knows how to accomplish this using any of the various graphical operations that are a part of the actionscript or flex environment, you would get my sincere thanks.

                                        E Offline
                                        E Offline
                                        El Corazon
                                        wrote on last edited by
                                        #26

                                        Force Code wrote:

                                        If BitBlt is encoded in hardware now, then presumably flash could access it, and probably they already are accessing it,

                                        It is in hardware, they are accessing it now through hardware. The primary goal in script languages such as this is to raise the script programmer to a higher level. If you program a low-level routine in script and compare it to a compiled language, the result would always favor the compiled language. So you don't even attempt to compete at that level. Where a graphic library in C/C++ might use BitBlt() functions to manipulate graphics, in script you operate on a higher level at the object level not at the bitmap level. This might actually employ several low-level routines, precompiled into a higher order function that is called by the script. As you grow the language to a higher level your speed approaches compiled equivalence because you are calling pre-compiled low-level routines from the high-level script language. So, yes, I although I have not used flash in ages, I am sure they are using BitBlt() functions in their code somewhere. Now flash did undergo an even larger change about 2 years ago, operating in 2D mode as a 3D higher order function. This bypasses the need for BitBlt() functions because of depth buffer access on a 3D graphics card and blend modes. Where a 3D graphics card is not available to use, I am sure the routines are still there to drop back to 2D BitBlt() functions. Or at least I should say, they intended to change. Since I don't do flash anymore I have no way to verify it. But I was there at the Siggraph presentation talking about the transition from 2D to 3D operations for both speed and additional functionality. The presentation was pretty good since they showed the three phases, reproducing exactly what they had in 2D to 3D (no visible change, only faster drawing), and then they added some "special effects" magic that would be unreasonable under 2D. Sorry, can't help you with the script language itself, I can help in theory, but not specifics when it comes to flash.

                                        _________________________ Asu no koto o ieba, tenjo de nezumi ga warau. Talk about things of tomorrow and the mice in the ceiling laugh. (Japanese Proverb)

                                        F 1 Reply Last reply
                                        0
                                        • E El Corazon

                                          Force Code wrote:

                                          If BitBlt is encoded in hardware now, then presumably flash could access it, and probably they already are accessing it,

                                          It is in hardware, they are accessing it now through hardware. The primary goal in script languages such as this is to raise the script programmer to a higher level. If you program a low-level routine in script and compare it to a compiled language, the result would always favor the compiled language. So you don't even attempt to compete at that level. Where a graphic library in C/C++ might use BitBlt() functions to manipulate graphics, in script you operate on a higher level at the object level not at the bitmap level. This might actually employ several low-level routines, precompiled into a higher order function that is called by the script. As you grow the language to a higher level your speed approaches compiled equivalence because you are calling pre-compiled low-level routines from the high-level script language. So, yes, I although I have not used flash in ages, I am sure they are using BitBlt() functions in their code somewhere. Now flash did undergo an even larger change about 2 years ago, operating in 2D mode as a 3D higher order function. This bypasses the need for BitBlt() functions because of depth buffer access on a 3D graphics card and blend modes. Where a 3D graphics card is not available to use, I am sure the routines are still there to drop back to 2D BitBlt() functions. Or at least I should say, they intended to change. Since I don't do flash anymore I have no way to verify it. But I was there at the Siggraph presentation talking about the transition from 2D to 3D operations for both speed and additional functionality. The presentation was pretty good since they showed the three phases, reproducing exactly what they had in 2D to 3D (no visible change, only faster drawing), and then they added some "special effects" magic that would be unreasonable under 2D. Sorry, can't help you with the script language itself, I can help in theory, but not specifics when it comes to flash.

                                          _________________________ Asu no koto o ieba, tenjo de nezumi ga warau. Talk about things of tomorrow and the mice in the ceiling laugh. (Japanese Proverb)

                                          F Offline
                                          F Offline
                                          Force Code
                                          wrote on last edited by
                                          #27

                                          El Corazon wrote:

                                          The primary goal in script languages such as this is to raise the script programmer to a higher level. If you program a low-level routine in script and compare it to a compiled language, the result would always favor the compiled language. So you don't even attempt to compete at that level. Where a graphic library in C/C++ might use BitBlt() functions to manipulate graphics, in script you operate on a higher level at the object level not at the bitmap level.

                                          Actionscript 3.0, which is what I'm using, nearly duplicates C and C++ syntax with very minor modifications. There is a BitmapData class as well, with a method GetPixels which dumps it into a byte array. So, they certainly give the impression that reasonable speeds might be possible on low-level operations.

                                          E 1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Don't have an account? Register

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups