Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. The Lounge
  3. Who says you can't beat the compiler? [part 1 of 2]

Who says you can't beat the compiler? [part 1 of 2]

Scheduled Pinned Locked Moved The Lounge
csharpcomalgorithmsdata-structuresperformance
42 Posts 20 Posters 3 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L Lost User

    Actually I do, too. Otherwise newbies might try what I did here:

    static void Normalize(Float3[] array)
    {
    for (int i = 0; i < array.Length; i++)
    {
    Float3 f = array[i];
    float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
    array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
    }
    }

    Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)

    movss xmm2,dword ptr [rsp+40h]
    movss xmm0,dword ptr [rsp+40h]
    mulss xmm2,xmm0
    movss xmm1,dword ptr [rsp+44h]
    movss xmm0,dword ptr [rsp+44h]
    mulss xmm1,xmm0
    addss xmm2,xmm1
    movss xmm1,dword ptr [rsp+48h]
    movss xmm0,dword ptr [rsp+48h]
    mulss xmm1,xmm0
    addss xmm2,xmm1
    cvtss2sd xmm0,xmm2
    sqrtsd xmm1,xmm0
    movsd xmm2,mmword ptr [00000160h]
    divsd xmm2,xmm1
    cvtsd2ss xmm0,xmm2
    movss xmm1,dword ptr [rsp+40h]
    mulss xmm1,xmm0
    movss xmm2,dword ptr [rsp+44h]
    mulss xmm2,xmm0
    movss xmm3,dword ptr [rsp+48h]
    mulss xmm3,xmm0

    (why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?

    _ Offline
    _ Offline
    _Damian S_
    wrote on last edited by
    #8

    Impressive... but I call that someone needs to check around to see if there's a life somewhere they can grab!! :laugh:

    I don't have ADHD, I have ADOS... Attention Deficit oooh SHINY!! If you like cars, check out the Booger Mobile blog | If you feel generous - make a donation to Camp Quality!!

    E 1 Reply Last reply
    0
    • L Lost User

      Actually I do, too. Otherwise newbies might try what I did here:

      static void Normalize(Float3[] array)
      {
      for (int i = 0; i < array.Length; i++)
      {
      Float3 f = array[i];
      float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
      array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
      }
      }

      Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)

      movss xmm2,dword ptr [rsp+40h]
      movss xmm0,dword ptr [rsp+40h]
      mulss xmm2,xmm0
      movss xmm1,dword ptr [rsp+44h]
      movss xmm0,dword ptr [rsp+44h]
      mulss xmm1,xmm0
      addss xmm2,xmm1
      movss xmm1,dword ptr [rsp+48h]
      movss xmm0,dword ptr [rsp+48h]
      mulss xmm1,xmm0
      addss xmm2,xmm1
      cvtss2sd xmm0,xmm2
      sqrtsd xmm1,xmm0
      movsd xmm2,mmword ptr [00000160h]
      divsd xmm2,xmm1
      cvtsd2ss xmm0,xmm2
      movss xmm1,dword ptr [rsp+40h]
      mulss xmm1,xmm0
      movss xmm2,dword ptr [rsp+44h]
      mulss xmm2,xmm0
      movss xmm3,dword ptr [rsp+48h]
      mulss xmm3,xmm0

      (why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?

      J Offline
      J Offline
      John M Drescher
      wrote on last edited by
      #9

      I would think it would not take a lot of craftiness to beat the .NET or a JVM with ASM.

      John

      L 1 Reply Last reply
      0
      • L Lost User

        Actually I do, too. Otherwise newbies might try what I did here:

        static void Normalize(Float3[] array)
        {
        for (int i = 0; i < array.Length; i++)
        {
        Float3 f = array[i];
        float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
        array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
        }
        }

        Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)

        movss xmm2,dword ptr [rsp+40h]
        movss xmm0,dword ptr [rsp+40h]
        mulss xmm2,xmm0
        movss xmm1,dword ptr [rsp+44h]
        movss xmm0,dword ptr [rsp+44h]
        mulss xmm1,xmm0
        addss xmm2,xmm1
        movss xmm1,dword ptr [rsp+48h]
        movss xmm0,dword ptr [rsp+48h]
        mulss xmm1,xmm0
        addss xmm2,xmm1
        cvtss2sd xmm0,xmm2
        sqrtsd xmm1,xmm0
        movsd xmm2,mmword ptr [00000160h]
        divsd xmm2,xmm1
        cvtsd2ss xmm0,xmm2
        movss xmm1,dword ptr [rsp+40h]
        mulss xmm1,xmm0
        movss xmm2,dword ptr [rsp+44h]
        mulss xmm2,xmm0
        movss xmm3,dword ptr [rsp+48h]
        mulss xmm3,xmm0

        (why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?

        J Offline
        J Offline
        Joe Woodbury
        wrote on last edited by
        #10

        Of course you can beat the compiler, the question is whether it's worth the time and effort. (Of course if you were really fanatical, you'd write this in native assembly and load either the 32-bit or 64-bit [or ARM or whatever] DLL and use that.)

        R M 2 Replies Last reply
        0
        • J John M Drescher

          I would think it would not take a lot of craftiness to beat the .NET or a JVM with ASM.

          John

          L Offline
          L Offline
          Lost User
          wrote on last edited by
          #11

          That's sortof the point..

          1 Reply Last reply
          0
          • R Rama Krishna Vavilala

            harold aptroot wrote:

            Who says you can't beat the compiler?

            I have never heard anyone say that.:~ Handcrafting assembly code has always been one way to optimize some critical routines. Though in practice it happens very rarely.

            H Offline
            H Offline
            Henry Minute
            wrote on last edited by
            #12

            Rama Krishna Vavilala wrote:

            I have never heard anyone say that

            That's not surprising. I googled the phrase and only got 3.75 pages back.

            Henry Minute Do not read medical books! You could die of a misprint. - Mark Twain Girl: (staring) "Why do you need an icy cucumber?" “I want to report a fraud. The government is lying to us all.”

            1 Reply Last reply
            0
            • L Lost User

              Actually I do, too. Otherwise newbies might try what I did here:

              static void Normalize(Float3[] array)
              {
              for (int i = 0; i < array.Length; i++)
              {
              Float3 f = array[i];
              float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
              array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
              }
              }

              Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)

              movss xmm2,dword ptr [rsp+40h]
              movss xmm0,dword ptr [rsp+40h]
              mulss xmm2,xmm0
              movss xmm1,dword ptr [rsp+44h]
              movss xmm0,dword ptr [rsp+44h]
              mulss xmm1,xmm0
              addss xmm2,xmm1
              movss xmm1,dword ptr [rsp+48h]
              movss xmm0,dword ptr [rsp+48h]
              mulss xmm1,xmm0
              addss xmm2,xmm1
              cvtss2sd xmm0,xmm2
              sqrtsd xmm1,xmm0
              movsd xmm2,mmword ptr [00000160h]
              divsd xmm2,xmm1
              cvtsd2ss xmm0,xmm2
              movss xmm1,dword ptr [rsp+40h]
              mulss xmm1,xmm0
              movss xmm2,dword ptr [rsp+44h]
              mulss xmm2,xmm0
              movss xmm3,dword ptr [rsp+48h]
              mulss xmm3,xmm0

              (why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?

              P Offline
              P Offline
              puromtec1
              wrote on last edited by
              #13

              Is it just my perception, or is this actually the first legitimate technical article posted in the lounge? I give it a 5, btw.

              L L 2 Replies Last reply
              0
              • L Lost User

                Surely we can do better than this. As it turns out, we can! I wrote a little class that uses VirtualAlloc (to get executable+writable memory) and a quickly hacked-together assembler so we can do this:

                        Asm asm = new Asm(@"
                        loop:
                            test    edx,    edx
                            jz      end
                            movaps  xmm0,   \[rcx\]
                        movaps	xmm2,   xmm0
                        mulps	xmm0,   xmm0
                        movaps	xmm1,   xmm0
                        shufps	xmm0,   xmm0,   ( 2, 1, 0, 3 )
                        addps	xmm1,   xmm0
                        movaps	xmm0,   xmm1
                        shufps	xmm1,   xmm1,   ( 1, 0, 3, 2 )
                        addps	xmm0,   xmm1
                        rsqrtps	xmm0,   xmm0
                        mulps	xmm0,   xmm2
                            movaps  \[rcx\],  xmm0
                            add     edx,    -1
                            add     rcx,    16
                            jmp loop
                        end:
                            ret
                        ");
                

                And to benchmark it, I used:

                        Float3\[\] f = new Float3\[0x1000\];
                        // we don't want to measure JIT overhead later
                        Normalize(f);
                        unsafe { asm.GetDelegate<Method>()((Float3\*)0, 0); }
                
                        for (int j = 0; j < 10; j++)
                        {
                            for (int i = 0; i < f.Length; i++)
                                f\[i\] = new Float3(1, 2, 3);
                
                            Stopwatch s2 = Stopwatch.StartNew();
                            Normalize(f);
                            s2.Stop();
                            Console.WriteLine("C#: " + s2.ElapsedTicks);
                
                            for (int i = 0; i < f.Length; i++)
                                f\[i\] = new Float3(1, 2, 3);
                
                            Method m = asm.GetDelegate<Method>();
                            Stopwatch s = Stopwatch.StartNew();
                            unsafe
                            {
                                fixed (Float3\* fptr = f)
                                {
                                    m(fptr, f.Length);
                                }
                            }
                            s.Stop();
                            Console.WriteLine("ASM: " + s.ElapsedTicks);
                        }
                

                Finally it's time for the results! Did we beat the compiler? YES! Here's the result of one run:

                C#: 624168
                ASM: 127836 // anyone? what happened here?
                C#: 615807
                ASM: 66465
                C#: 615780
                ASM: 66294
                C#: 615726
                ASM: 66276
                C#: 615717
                ASM: 66285
                C#: 615744
                ASM: 66285
                C#: 615726
                ASM: 66276
                C#: 617112
                ASM: 66285
                C#: 615726
                ASM: 66285
                C#: 615735
                ASM: 66285

                So there you have it, you can beat the compiler. The .NET JIT compiler at least :) Some small notes: - using

                L Offline
                L Offline
                leppie
                wrote on last edited by
                #14

                harold aptroot wrote:

                ASM: 127836 // anyone? what happened here?

                Did you call Marshal.Prelink after getting the function pointer?

                xacc.ide
                IronScheme - 1.0 RC 1 - out now!
                ((λ (x) `(,x ',x)) '(λ (x) `(,x ',x))) The Scheme Programming Language – Fourth Edition

                L 1 Reply Last reply
                0
                • P puromtec1

                  Is it just my perception, or is this actually the first legitimate technical article posted in the lounge? I give it a 5, btw.

                  L Offline
                  L Offline
                  leppie
                  wrote on last edited by
                  #15

                  puromtec1 wrote:

                  first legitimate technical article posted in the lounge?

                  I suspect he couldn't find his blog :)

                  xacc.ide
                  IronScheme - 1.0 RC 1 - out now!
                  ((λ (x) `(,x ',x)) '(λ (x) `(,x ',x))) The Scheme Programming Language – Fourth Edition

                  1 Reply Last reply
                  0
                  • J Joe Woodbury

                    Of course you can beat the compiler, the question is whether it's worth the time and effort. (Of course if you were really fanatical, you'd write this in native assembly and load either the 32-bit or 64-bit [or ARM or whatever] DLL and use that.)

                    R Offline
                    R Offline
                    Rick York
                    wrote on last edited by
                    #16

                    That was similar to my thought also. It would be interesting to see the relative performance of native C++ and native assembly code on that snippet.

                    1 Reply Last reply
                    0
                    • L Lost User

                      Actually I do, too. Otherwise newbies might try what I did here:

                      static void Normalize(Float3[] array)
                      {
                      for (int i = 0; i < array.Length; i++)
                      {
                      Float3 f = array[i];
                      float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
                      array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
                      }
                      }

                      Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)

                      movss xmm2,dword ptr [rsp+40h]
                      movss xmm0,dword ptr [rsp+40h]
                      mulss xmm2,xmm0
                      movss xmm1,dword ptr [rsp+44h]
                      movss xmm0,dword ptr [rsp+44h]
                      mulss xmm1,xmm0
                      addss xmm2,xmm1
                      movss xmm1,dword ptr [rsp+48h]
                      movss xmm0,dword ptr [rsp+48h]
                      mulss xmm1,xmm0
                      addss xmm2,xmm1
                      cvtss2sd xmm0,xmm2
                      sqrtsd xmm1,xmm0
                      movsd xmm2,mmword ptr [00000160h]
                      divsd xmm2,xmm1
                      cvtsd2ss xmm0,xmm2
                      movss xmm1,dword ptr [rsp+40h]
                      mulss xmm1,xmm0
                      movss xmm2,dword ptr [rsp+44h]
                      mulss xmm2,xmm0
                      movss xmm3,dword ptr [rsp+48h]
                      mulss xmm3,xmm0

                      (why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?

                      0 Offline
                      0 Offline
                      0x3c0
                      wrote on last edited by
                      #17

                      I enjoyed reading that - perhaps you could post it as a Tip and Trick?

                      OSDev :)

                      1 Reply Last reply
                      0
                      • L Lost User

                        Actually I do, too. Otherwise newbies might try what I did here:

                        static void Normalize(Float3[] array)
                        {
                        for (int i = 0; i < array.Length; i++)
                        {
                        Float3 f = array[i];
                        float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
                        array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
                        }
                        }

                        Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)

                        movss xmm2,dword ptr [rsp+40h]
                        movss xmm0,dword ptr [rsp+40h]
                        mulss xmm2,xmm0
                        movss xmm1,dword ptr [rsp+44h]
                        movss xmm0,dword ptr [rsp+44h]
                        mulss xmm1,xmm0
                        addss xmm2,xmm1
                        movss xmm1,dword ptr [rsp+48h]
                        movss xmm0,dword ptr [rsp+48h]
                        mulss xmm1,xmm0
                        addss xmm2,xmm1
                        cvtss2sd xmm0,xmm2
                        sqrtsd xmm1,xmm0
                        movsd xmm2,mmword ptr [00000160h]
                        divsd xmm2,xmm1
                        cvtsd2ss xmm0,xmm2
                        movss xmm1,dword ptr [rsp+40h]
                        mulss xmm1,xmm0
                        movss xmm2,dword ptr [rsp+44h]
                        mulss xmm2,xmm0
                        movss xmm3,dword ptr [rsp+48h]
                        mulss xmm3,xmm0

                        (why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?

                        P Offline
                        P Offline
                        peterchen
                        wrote on last edited by
                        #18

                        TL;DR, articles go here[^]

                        Agh! Reality! My Archnemesis![^]
                        | FoldWithUs! | sighist | WhoIncludes - Analyzing C++ include file hierarchy

                        1 Reply Last reply
                        0
                        • L Lost User

                          Actually I do, too. Otherwise newbies might try what I did here:

                          static void Normalize(Float3[] array)
                          {
                          for (int i = 0; i < array.Length; i++)
                          {
                          Float3 f = array[i];
                          float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
                          array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
                          }
                          }

                          Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)

                          movss xmm2,dword ptr [rsp+40h]
                          movss xmm0,dword ptr [rsp+40h]
                          mulss xmm2,xmm0
                          movss xmm1,dword ptr [rsp+44h]
                          movss xmm0,dword ptr [rsp+44h]
                          mulss xmm1,xmm0
                          addss xmm2,xmm1
                          movss xmm1,dword ptr [rsp+48h]
                          movss xmm0,dword ptr [rsp+48h]
                          mulss xmm1,xmm0
                          addss xmm2,xmm1
                          cvtss2sd xmm0,xmm2
                          sqrtsd xmm1,xmm0
                          movsd xmm2,mmword ptr [00000160h]
                          divsd xmm2,xmm1
                          cvtsd2ss xmm0,xmm2
                          movss xmm1,dword ptr [rsp+40h]
                          mulss xmm1,xmm0
                          movss xmm2,dword ptr [rsp+44h]
                          mulss xmm2,xmm0
                          movss xmm3,dword ptr [rsp+48h]
                          mulss xmm3,xmm0

                          (why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?

                          M Offline
                          M Offline
                          Mladen Jankovic
                          wrote on last edited by
                          #19

                          Yeah, but the real question is - can you beat Intel C++ compiler?

                          [Genetic Algorithm Library] [Wowd]

                          modified on Wednesday, September 8, 2010 2:26 AM

                          L 1 Reply Last reply
                          0
                          • L Lost User

                            Surely we can do better than this. As it turns out, we can! I wrote a little class that uses VirtualAlloc (to get executable+writable memory) and a quickly hacked-together assembler so we can do this:

                                    Asm asm = new Asm(@"
                                    loop:
                                        test    edx,    edx
                                        jz      end
                                        movaps  xmm0,   \[rcx\]
                                    movaps	xmm2,   xmm0
                                    mulps	xmm0,   xmm0
                                    movaps	xmm1,   xmm0
                                    shufps	xmm0,   xmm0,   ( 2, 1, 0, 3 )
                                    addps	xmm1,   xmm0
                                    movaps	xmm0,   xmm1
                                    shufps	xmm1,   xmm1,   ( 1, 0, 3, 2 )
                                    addps	xmm0,   xmm1
                                    rsqrtps	xmm0,   xmm0
                                    mulps	xmm0,   xmm2
                                        movaps  \[rcx\],  xmm0
                                        add     edx,    -1
                                        add     rcx,    16
                                        jmp loop
                                    end:
                                        ret
                                    ");
                            

                            And to benchmark it, I used:

                                    Float3\[\] f = new Float3\[0x1000\];
                                    // we don't want to measure JIT overhead later
                                    Normalize(f);
                                    unsafe { asm.GetDelegate<Method>()((Float3\*)0, 0); }
                            
                                    for (int j = 0; j < 10; j++)
                                    {
                                        for (int i = 0; i < f.Length; i++)
                                            f\[i\] = new Float3(1, 2, 3);
                            
                                        Stopwatch s2 = Stopwatch.StartNew();
                                        Normalize(f);
                                        s2.Stop();
                                        Console.WriteLine("C#: " + s2.ElapsedTicks);
                            
                                        for (int i = 0; i < f.Length; i++)
                                            f\[i\] = new Float3(1, 2, 3);
                            
                                        Method m = asm.GetDelegate<Method>();
                                        Stopwatch s = Stopwatch.StartNew();
                                        unsafe
                                        {
                                            fixed (Float3\* fptr = f)
                                            {
                                                m(fptr, f.Length);
                                            }
                                        }
                                        s.Stop();
                                        Console.WriteLine("ASM: " + s.ElapsedTicks);
                                    }
                            

                            Finally it's time for the results! Did we beat the compiler? YES! Here's the result of one run:

                            C#: 624168
                            ASM: 127836 // anyone? what happened here?
                            C#: 615807
                            ASM: 66465
                            C#: 615780
                            ASM: 66294
                            C#: 615726
                            ASM: 66276
                            C#: 615717
                            ASM: 66285
                            C#: 615744
                            ASM: 66285
                            C#: 615726
                            ASM: 66276
                            C#: 617112
                            ASM: 66285
                            C#: 615726
                            ASM: 66285
                            C#: 615735
                            ASM: 66285

                            So there you have it, you can beat the compiler. The .NET JIT compiler at least :) Some small notes: - using

                            L Offline
                            L Offline
                            Lost User
                            wrote on last edited by
                            #20

                            harold aptroot wrote:

                            ASM: 127836 // anyone? what happened here?

                            Instructions written into memory generated by VirtualAlloc will most likely cause a L1/L2 cache miss. The extra clock cycles were probably spent utilizing the TLB to find the physical memory offset. You can try using the prefetchnta instruction to move the memory into L1 if you want to avoid the initial cache miss. Keep in mind that prefetchnta is only a hint and will sometimes be ignored under certain conditions. Best Wishes, -David Delaune

                            L 1 Reply Last reply
                            0
                            • L Lost User

                              Actually I do, too. Otherwise newbies might try what I did here:

                              static void Normalize(Float3[] array)
                              {
                              for (int i = 0; i < array.Length; i++)
                              {
                              Float3 f = array[i];
                              float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
                              array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
                              }
                              }

                              Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)

                              movss xmm2,dword ptr [rsp+40h]
                              movss xmm0,dword ptr [rsp+40h]
                              mulss xmm2,xmm0
                              movss xmm1,dword ptr [rsp+44h]
                              movss xmm0,dword ptr [rsp+44h]
                              mulss xmm1,xmm0
                              addss xmm2,xmm1
                              movss xmm1,dword ptr [rsp+48h]
                              movss xmm0,dword ptr [rsp+48h]
                              mulss xmm1,xmm0
                              addss xmm2,xmm1
                              cvtss2sd xmm0,xmm2
                              sqrtsd xmm1,xmm0
                              movsd xmm2,mmword ptr [00000160h]
                              divsd xmm2,xmm1
                              cvtsd2ss xmm0,xmm2
                              movss xmm1,dword ptr [rsp+40h]
                              mulss xmm1,xmm0
                              movss xmm2,dword ptr [rsp+44h]
                              mulss xmm2,xmm0
                              movss xmm3,dword ptr [rsp+48h]
                              mulss xmm3,xmm0

                              (why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?

                              D Offline
                              D Offline
                              Daniel Grunwald
                              wrote on last edited by
                              #21

                              I haven't seen a compiler that can properly use SSE2 yet. Auto-vectorization often only works in trivial cases, in other cases packed instructions go unused. However, instead of dropping to assembler, you can to write C code using intrinsics - this way you only select the ASM instructions to use, and the compiler will pick the instruction ordering and register allocation for you. And unlike inline ASM code, intrinsics are portable between multiple C compilers and between x86 and x86-64. Also, your optimized code is not equivalent - it has a much lower floating point precision. RSQRTPS is (a lot) faster than SQRTPS, but has much less precision. You need to do an additional Newton-Raphson step to arrive somewhere close to normal float precision. Of course, the loss of precision may be acceptable in your case, but it's the reason compilers cannot do this optimization automatically. Moreover, the way your C# code is written, even the cvtss2sd/cvtsd2ss dance is mandatory for the compiler: you are dividing a double (1.0) by a double (Math.Sqrt result), so the compiler is not allowed to introduce additional rounding errors by rounding the intermediate result to float. You might have gotten more efficient code by writing

                              float invLen = 1.0f / (float)Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z);

                              L 1 Reply Last reply
                              0
                              • L Lost User

                                Surely we can do better than this. As it turns out, we can! I wrote a little class that uses VirtualAlloc (to get executable+writable memory) and a quickly hacked-together assembler so we can do this:

                                        Asm asm = new Asm(@"
                                        loop:
                                            test    edx,    edx
                                            jz      end
                                            movaps  xmm0,   \[rcx\]
                                        movaps	xmm2,   xmm0
                                        mulps	xmm0,   xmm0
                                        movaps	xmm1,   xmm0
                                        shufps	xmm0,   xmm0,   ( 2, 1, 0, 3 )
                                        addps	xmm1,   xmm0
                                        movaps	xmm0,   xmm1
                                        shufps	xmm1,   xmm1,   ( 1, 0, 3, 2 )
                                        addps	xmm0,   xmm1
                                        rsqrtps	xmm0,   xmm0
                                        mulps	xmm0,   xmm2
                                            movaps  \[rcx\],  xmm0
                                            add     edx,    -1
                                            add     rcx,    16
                                            jmp loop
                                        end:
                                            ret
                                        ");
                                

                                And to benchmark it, I used:

                                        Float3\[\] f = new Float3\[0x1000\];
                                        // we don't want to measure JIT overhead later
                                        Normalize(f);
                                        unsafe { asm.GetDelegate<Method>()((Float3\*)0, 0); }
                                
                                        for (int j = 0; j < 10; j++)
                                        {
                                            for (int i = 0; i < f.Length; i++)
                                                f\[i\] = new Float3(1, 2, 3);
                                
                                            Stopwatch s2 = Stopwatch.StartNew();
                                            Normalize(f);
                                            s2.Stop();
                                            Console.WriteLine("C#: " + s2.ElapsedTicks);
                                
                                            for (int i = 0; i < f.Length; i++)
                                                f\[i\] = new Float3(1, 2, 3);
                                
                                            Method m = asm.GetDelegate<Method>();
                                            Stopwatch s = Stopwatch.StartNew();
                                            unsafe
                                            {
                                                fixed (Float3\* fptr = f)
                                                {
                                                    m(fptr, f.Length);
                                                }
                                            }
                                            s.Stop();
                                            Console.WriteLine("ASM: " + s.ElapsedTicks);
                                        }
                                

                                Finally it's time for the results! Did we beat the compiler? YES! Here's the result of one run:

                                C#: 624168
                                ASM: 127836 // anyone? what happened here?
                                C#: 615807
                                ASM: 66465
                                C#: 615780
                                ASM: 66294
                                C#: 615726
                                ASM: 66276
                                C#: 615717
                                ASM: 66285
                                C#: 615744
                                ASM: 66285
                                C#: 615726
                                ASM: 66276
                                C#: 617112
                                ASM: 66285
                                C#: 615726
                                ASM: 66285
                                C#: 615735
                                ASM: 66285

                                So there you have it, you can beat the compiler. The .NET JIT compiler at least :) Some small notes: - using

                                G Offline
                                G Offline
                                Giorgi Dalakishvili
                                wrote on last edited by
                                #22

                                Sorry about stupid question but where does the Asm class come from?

                                Giorgi Dalakishvili #region signature My Articles Browsing xkcd in a windows 7 way[^] #endregion

                                C L 2 Replies Last reply
                                0
                                • G Giorgi Dalakishvili

                                  Sorry about stupid question but where does the Asm class come from?

                                  Giorgi Dalakishvili #region signature My Articles Browsing xkcd in a windows 7 way[^] #endregion

                                  C Offline
                                  C Offline
                                  Cesar de Souza
                                  wrote on last edited by
                                  #23

                                  I am also interested in the answer :)

                                  http://crsouza.com

                                  1 Reply Last reply
                                  0
                                  • L leppie

                                    harold aptroot wrote:

                                    ASM: 127836 // anyone? what happened here?

                                    Did you call Marshal.Prelink after getting the function pointer?

                                    xacc.ide
                                    IronScheme - 1.0 RC 1 - out now!
                                    ((λ (x) `(,x ',x)) '(λ (x) `(,x ',x))) The Scheme Programming Language – Fourth Edition

                                    L Offline
                                    L Offline
                                    Lost User
                                    wrote on last edited by
                                    #24

                                    No, thanks, I'll keep that in mind :)

                                    1 Reply Last reply
                                    0
                                    • L Lost User

                                      Actually I do, too. Otherwise newbies might try what I did here:

                                      static void Normalize(Float3[] array)
                                      {
                                      for (int i = 0; i < array.Length; i++)
                                      {
                                      Float3 f = array[i];
                                      float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
                                      array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
                                      }
                                      }

                                      Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)

                                      movss xmm2,dword ptr [rsp+40h]
                                      movss xmm0,dword ptr [rsp+40h]
                                      mulss xmm2,xmm0
                                      movss xmm1,dword ptr [rsp+44h]
                                      movss xmm0,dword ptr [rsp+44h]
                                      mulss xmm1,xmm0
                                      addss xmm2,xmm1
                                      movss xmm1,dword ptr [rsp+48h]
                                      movss xmm0,dword ptr [rsp+48h]
                                      mulss xmm1,xmm0
                                      addss xmm2,xmm1
                                      cvtss2sd xmm0,xmm2
                                      sqrtsd xmm1,xmm0
                                      movsd xmm2,mmword ptr [00000160h]
                                      divsd xmm2,xmm1
                                      cvtsd2ss xmm0,xmm2
                                      movss xmm1,dword ptr [rsp+40h]
                                      mulss xmm1,xmm0
                                      movss xmm2,dword ptr [rsp+44h]
                                      mulss xmm2,xmm0
                                      movss xmm3,dword ptr [rsp+48h]
                                      mulss xmm3,xmm0

                                      (why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?

                                      C Offline
                                      C Offline
                                      Chris Trelawny Ross
                                      wrote on last edited by
                                      #25

                                      Umm ...

                                      harold aptroot wrote:

                                      The Just In Time compiler didn't even do such a bad job here...

                                      ... the Just in Time compiler takes as its input the IL code that is generated by the C# compiler or, in the case of the IL assembly language you hand crafted, the output of the IL assembler. So, actually, the JIT compiler hasn't done anything yet. :-\

                                      L 1 Reply Last reply
                                      0
                                      • G Giorgi Dalakishvili

                                        Sorry about stupid question but where does the Asm class come from?

                                        Giorgi Dalakishvili #region signature My Articles Browsing xkcd in a windows 7 way[^] #endregion

                                        L Offline
                                        L Offline
                                        Lost User
                                        wrote on last edited by
                                        #26

                                        I wrote it, the source is a bit long for here though since it includes a rudimentary assembler

                                        G 1 Reply Last reply
                                        0
                                        • M Mladen Jankovic

                                          Yeah, but the real question is - can you beat Intel C++ compiler?

                                          [Genetic Algorithm Library] [Wowd]

                                          modified on Wednesday, September 8, 2010 2:26 AM

                                          L Offline
                                          L Offline
                                          Lost User
                                          wrote on last edited by
                                          #27

                                          I doubt it, it's pretty smart..

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Don't have an account? Register

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups