Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. The Lounge
  3. Who says you can't beat the compiler? [part 1 of 2]

Who says you can't beat the compiler? [part 1 of 2]

Scheduled Pinned Locked Moved The Lounge
csharpcomalgorithmsdata-structuresperformance
42 Posts 20 Posters 3 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L Lost User

    Actually I do, too. Otherwise newbies might try what I did here:

    static void Normalize(Float3[] array)
    {
    for (int i = 0; i < array.Length; i++)
    {
    Float3 f = array[i];
    float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
    array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
    }
    }

    Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)

    movss xmm2,dword ptr [rsp+40h]
    movss xmm0,dword ptr [rsp+40h]
    mulss xmm2,xmm0
    movss xmm1,dword ptr [rsp+44h]
    movss xmm0,dword ptr [rsp+44h]
    mulss xmm1,xmm0
    addss xmm2,xmm1
    movss xmm1,dword ptr [rsp+48h]
    movss xmm0,dword ptr [rsp+48h]
    mulss xmm1,xmm0
    addss xmm2,xmm1
    cvtss2sd xmm0,xmm2
    sqrtsd xmm1,xmm0
    movsd xmm2,mmword ptr [00000160h]
    divsd xmm2,xmm1
    cvtsd2ss xmm0,xmm2
    movss xmm1,dword ptr [rsp+40h]
    mulss xmm1,xmm0
    movss xmm2,dword ptr [rsp+44h]
    mulss xmm2,xmm0
    movss xmm3,dword ptr [rsp+48h]
    mulss xmm3,xmm0

    (why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?

    M Offline
    M Offline
    Mladen Jankovic
    wrote on last edited by
    #19

    Yeah, but the real question is - can you beat Intel C++ compiler?

    [Genetic Algorithm Library] [Wowd]

    modified on Wednesday, September 8, 2010 2:26 AM

    L 1 Reply Last reply
    0
    • L Lost User

      Surely we can do better than this. As it turns out, we can! I wrote a little class that uses VirtualAlloc (to get executable+writable memory) and a quickly hacked-together assembler so we can do this:

              Asm asm = new Asm(@"
              loop:
                  test    edx,    edx
                  jz      end
                  movaps  xmm0,   \[rcx\]
              movaps	xmm2,   xmm0
              mulps	xmm0,   xmm0
              movaps	xmm1,   xmm0
              shufps	xmm0,   xmm0,   ( 2, 1, 0, 3 )
              addps	xmm1,   xmm0
              movaps	xmm0,   xmm1
              shufps	xmm1,   xmm1,   ( 1, 0, 3, 2 )
              addps	xmm0,   xmm1
              rsqrtps	xmm0,   xmm0
              mulps	xmm0,   xmm2
                  movaps  \[rcx\],  xmm0
                  add     edx,    -1
                  add     rcx,    16
                  jmp loop
              end:
                  ret
              ");
      

      And to benchmark it, I used:

              Float3\[\] f = new Float3\[0x1000\];
              // we don't want to measure JIT overhead later
              Normalize(f);
              unsafe { asm.GetDelegate<Method>()((Float3\*)0, 0); }
      
              for (int j = 0; j < 10; j++)
              {
                  for (int i = 0; i < f.Length; i++)
                      f\[i\] = new Float3(1, 2, 3);
      
                  Stopwatch s2 = Stopwatch.StartNew();
                  Normalize(f);
                  s2.Stop();
                  Console.WriteLine("C#: " + s2.ElapsedTicks);
      
                  for (int i = 0; i < f.Length; i++)
                      f\[i\] = new Float3(1, 2, 3);
      
                  Method m = asm.GetDelegate<Method>();
                  Stopwatch s = Stopwatch.StartNew();
                  unsafe
                  {
                      fixed (Float3\* fptr = f)
                      {
                          m(fptr, f.Length);
                      }
                  }
                  s.Stop();
                  Console.WriteLine("ASM: " + s.ElapsedTicks);
              }
      

      Finally it's time for the results! Did we beat the compiler? YES! Here's the result of one run:

      C#: 624168
      ASM: 127836 // anyone? what happened here?
      C#: 615807
      ASM: 66465
      C#: 615780
      ASM: 66294
      C#: 615726
      ASM: 66276
      C#: 615717
      ASM: 66285
      C#: 615744
      ASM: 66285
      C#: 615726
      ASM: 66276
      C#: 617112
      ASM: 66285
      C#: 615726
      ASM: 66285
      C#: 615735
      ASM: 66285

      So there you have it, you can beat the compiler. The .NET JIT compiler at least :) Some small notes: - using

      L Offline
      L Offline
      Lost User
      wrote on last edited by
      #20

      harold aptroot wrote:

      ASM: 127836 // anyone? what happened here?

      Instructions written into memory generated by VirtualAlloc will most likely cause a L1/L2 cache miss. The extra clock cycles were probably spent utilizing the TLB to find the physical memory offset. You can try using the prefetchnta instruction to move the memory into L1 if you want to avoid the initial cache miss. Keep in mind that prefetchnta is only a hint and will sometimes be ignored under certain conditions. Best Wishes, -David Delaune

      L 1 Reply Last reply
      0
      • L Lost User

        Actually I do, too. Otherwise newbies might try what I did here:

        static void Normalize(Float3[] array)
        {
        for (int i = 0; i < array.Length; i++)
        {
        Float3 f = array[i];
        float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
        array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
        }
        }

        Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)

        movss xmm2,dword ptr [rsp+40h]
        movss xmm0,dword ptr [rsp+40h]
        mulss xmm2,xmm0
        movss xmm1,dword ptr [rsp+44h]
        movss xmm0,dword ptr [rsp+44h]
        mulss xmm1,xmm0
        addss xmm2,xmm1
        movss xmm1,dword ptr [rsp+48h]
        movss xmm0,dword ptr [rsp+48h]
        mulss xmm1,xmm0
        addss xmm2,xmm1
        cvtss2sd xmm0,xmm2
        sqrtsd xmm1,xmm0
        movsd xmm2,mmword ptr [00000160h]
        divsd xmm2,xmm1
        cvtsd2ss xmm0,xmm2
        movss xmm1,dword ptr [rsp+40h]
        mulss xmm1,xmm0
        movss xmm2,dword ptr [rsp+44h]
        mulss xmm2,xmm0
        movss xmm3,dword ptr [rsp+48h]
        mulss xmm3,xmm0

        (why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?

        D Offline
        D Offline
        Daniel Grunwald
        wrote on last edited by
        #21

        I haven't seen a compiler that can properly use SSE2 yet. Auto-vectorization often only works in trivial cases, in other cases packed instructions go unused. However, instead of dropping to assembler, you can to write C code using intrinsics - this way you only select the ASM instructions to use, and the compiler will pick the instruction ordering and register allocation for you. And unlike inline ASM code, intrinsics are portable between multiple C compilers and between x86 and x86-64. Also, your optimized code is not equivalent - it has a much lower floating point precision. RSQRTPS is (a lot) faster than SQRTPS, but has much less precision. You need to do an additional Newton-Raphson step to arrive somewhere close to normal float precision. Of course, the loss of precision may be acceptable in your case, but it's the reason compilers cannot do this optimization automatically. Moreover, the way your C# code is written, even the cvtss2sd/cvtsd2ss dance is mandatory for the compiler: you are dividing a double (1.0) by a double (Math.Sqrt result), so the compiler is not allowed to introduce additional rounding errors by rounding the intermediate result to float. You might have gotten more efficient code by writing

        float invLen = 1.0f / (float)Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z);

        L 1 Reply Last reply
        0
        • L Lost User

          Surely we can do better than this. As it turns out, we can! I wrote a little class that uses VirtualAlloc (to get executable+writable memory) and a quickly hacked-together assembler so we can do this:

                  Asm asm = new Asm(@"
                  loop:
                      test    edx,    edx
                      jz      end
                      movaps  xmm0,   \[rcx\]
                  movaps	xmm2,   xmm0
                  mulps	xmm0,   xmm0
                  movaps	xmm1,   xmm0
                  shufps	xmm0,   xmm0,   ( 2, 1, 0, 3 )
                  addps	xmm1,   xmm0
                  movaps	xmm0,   xmm1
                  shufps	xmm1,   xmm1,   ( 1, 0, 3, 2 )
                  addps	xmm0,   xmm1
                  rsqrtps	xmm0,   xmm0
                  mulps	xmm0,   xmm2
                      movaps  \[rcx\],  xmm0
                      add     edx,    -1
                      add     rcx,    16
                      jmp loop
                  end:
                      ret
                  ");
          

          And to benchmark it, I used:

                  Float3\[\] f = new Float3\[0x1000\];
                  // we don't want to measure JIT overhead later
                  Normalize(f);
                  unsafe { asm.GetDelegate<Method>()((Float3\*)0, 0); }
          
                  for (int j = 0; j < 10; j++)
                  {
                      for (int i = 0; i < f.Length; i++)
                          f\[i\] = new Float3(1, 2, 3);
          
                      Stopwatch s2 = Stopwatch.StartNew();
                      Normalize(f);
                      s2.Stop();
                      Console.WriteLine("C#: " + s2.ElapsedTicks);
          
                      for (int i = 0; i < f.Length; i++)
                          f\[i\] = new Float3(1, 2, 3);
          
                      Method m = asm.GetDelegate<Method>();
                      Stopwatch s = Stopwatch.StartNew();
                      unsafe
                      {
                          fixed (Float3\* fptr = f)
                          {
                              m(fptr, f.Length);
                          }
                      }
                      s.Stop();
                      Console.WriteLine("ASM: " + s.ElapsedTicks);
                  }
          

          Finally it's time for the results! Did we beat the compiler? YES! Here's the result of one run:

          C#: 624168
          ASM: 127836 // anyone? what happened here?
          C#: 615807
          ASM: 66465
          C#: 615780
          ASM: 66294
          C#: 615726
          ASM: 66276
          C#: 615717
          ASM: 66285
          C#: 615744
          ASM: 66285
          C#: 615726
          ASM: 66276
          C#: 617112
          ASM: 66285
          C#: 615726
          ASM: 66285
          C#: 615735
          ASM: 66285

          So there you have it, you can beat the compiler. The .NET JIT compiler at least :) Some small notes: - using

          G Offline
          G Offline
          Giorgi Dalakishvili
          wrote on last edited by
          #22

          Sorry about stupid question but where does the Asm class come from?

          Giorgi Dalakishvili #region signature My Articles Browsing xkcd in a windows 7 way[^] #endregion

          C L 2 Replies Last reply
          0
          • G Giorgi Dalakishvili

            Sorry about stupid question but where does the Asm class come from?

            Giorgi Dalakishvili #region signature My Articles Browsing xkcd in a windows 7 way[^] #endregion

            C Offline
            C Offline
            Cesar de Souza
            wrote on last edited by
            #23

            I am also interested in the answer :)

            http://crsouza.com

            1 Reply Last reply
            0
            • L leppie

              harold aptroot wrote:

              ASM: 127836 // anyone? what happened here?

              Did you call Marshal.Prelink after getting the function pointer?

              xacc.ide
              IronScheme - 1.0 RC 1 - out now!
              ((λ (x) `(,x ',x)) '(λ (x) `(,x ',x))) The Scheme Programming Language – Fourth Edition

              L Offline
              L Offline
              Lost User
              wrote on last edited by
              #24

              No, thanks, I'll keep that in mind :)

              1 Reply Last reply
              0
              • L Lost User

                Actually I do, too. Otherwise newbies might try what I did here:

                static void Normalize(Float3[] array)
                {
                for (int i = 0; i < array.Length; i++)
                {
                Float3 f = array[i];
                float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
                array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
                }
                }

                Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)

                movss xmm2,dword ptr [rsp+40h]
                movss xmm0,dword ptr [rsp+40h]
                mulss xmm2,xmm0
                movss xmm1,dword ptr [rsp+44h]
                movss xmm0,dword ptr [rsp+44h]
                mulss xmm1,xmm0
                addss xmm2,xmm1
                movss xmm1,dword ptr [rsp+48h]
                movss xmm0,dword ptr [rsp+48h]
                mulss xmm1,xmm0
                addss xmm2,xmm1
                cvtss2sd xmm0,xmm2
                sqrtsd xmm1,xmm0
                movsd xmm2,mmword ptr [00000160h]
                divsd xmm2,xmm1
                cvtsd2ss xmm0,xmm2
                movss xmm1,dword ptr [rsp+40h]
                mulss xmm1,xmm0
                movss xmm2,dword ptr [rsp+44h]
                mulss xmm2,xmm0
                movss xmm3,dword ptr [rsp+48h]
                mulss xmm3,xmm0

                (why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?

                C Offline
                C Offline
                Chris Trelawny Ross
                wrote on last edited by
                #25

                Umm ...

                harold aptroot wrote:

                The Just In Time compiler didn't even do such a bad job here...

                ... the Just in Time compiler takes as its input the IL code that is generated by the C# compiler or, in the case of the IL assembly language you hand crafted, the output of the IL assembler. So, actually, the JIT compiler hasn't done anything yet. :-\

                L 1 Reply Last reply
                0
                • G Giorgi Dalakishvili

                  Sorry about stupid question but where does the Asm class come from?

                  Giorgi Dalakishvili #region signature My Articles Browsing xkcd in a windows 7 way[^] #endregion

                  L Offline
                  L Offline
                  Lost User
                  wrote on last edited by
                  #26

                  I wrote it, the source is a bit long for here though since it includes a rudimentary assembler

                  G 1 Reply Last reply
                  0
                  • M Mladen Jankovic

                    Yeah, but the real question is - can you beat Intel C++ compiler?

                    [Genetic Algorithm Library] [Wowd]

                    modified on Wednesday, September 8, 2010 2:26 AM

                    L Offline
                    L Offline
                    Lost User
                    wrote on last edited by
                    #27

                    I doubt it, it's pretty smart..

                    1 Reply Last reply
                    0
                    • C Chris Trelawny Ross

                      Umm ...

                      harold aptroot wrote:

                      The Just In Time compiler didn't even do such a bad job here...

                      ... the Just in Time compiler takes as its input the IL code that is generated by the C# compiler or, in the case of the IL assembly language you hand crafted, the output of the IL assembler. So, actually, the JIT compiler hasn't done anything yet. :-\

                      L Offline
                      L Offline
                      Lost User
                      wrote on last edited by
                      #28

                      I'm not sure I know what you mean, but I didn't write any MSIL, just native assembly

                      C 1 Reply Last reply
                      0
                      • L Lost User

                        I wrote it, the source is a bit long for here though since it includes a rudimentary assembler

                        G Offline
                        G Offline
                        Giorgi Dalakishvili
                        wrote on last edited by
                        #29

                        What about writing an article? It would be interesting to see how you inject it in your application.

                        Giorgi Dalakishvili #region signature My Articles Browsing xkcd in a windows 7 way[^] #endregion

                        E L 2 Replies Last reply
                        0
                        • D Daniel Grunwald

                          I haven't seen a compiler that can properly use SSE2 yet. Auto-vectorization often only works in trivial cases, in other cases packed instructions go unused. However, instead of dropping to assembler, you can to write C code using intrinsics - this way you only select the ASM instructions to use, and the compiler will pick the instruction ordering and register allocation for you. And unlike inline ASM code, intrinsics are portable between multiple C compilers and between x86 and x86-64. Also, your optimized code is not equivalent - it has a much lower floating point precision. RSQRTPS is (a lot) faster than SQRTPS, but has much less precision. You need to do an additional Newton-Raphson step to arrive somewhere close to normal float precision. Of course, the loss of precision may be acceptable in your case, but it's the reason compilers cannot do this optimization automatically. Moreover, the way your C# code is written, even the cvtss2sd/cvtsd2ss dance is mandatory for the compiler: you are dividing a double (1.0) by a double (Math.Sqrt result), so the compiler is not allowed to introduce additional rounding errors by rounding the intermediate result to float. You might have gotten more efficient code by writing

                          float invLen = 1.0f / (float)Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z);

                          L Offline
                          L Offline
                          Lost User
                          wrote on last edited by
                          #30

                          Daniel Grunwald wrote:

                          Also, your optimized code is not equivalent - it has a much lower floating point precision.

                          I did make a note to that effect - the non-cheating version was still over 3 times as fast the C# version.

                          Daniel Grunwald wrote:

                          Moreover, the way your C# code is written, even the cvtss2sd/cvtsd2ss dance is mandatory for the compiler:

                          Fair enough. I made the change you suggested, and the result is:

                          movss xmm2,dword ptr [rsp+40h]
                          movss xmm0,dword ptr [rsp+40h]
                          mulss xmm2,xmm0
                          movss xmm1,dword ptr [rsp+44h]
                          movss xmm0,dword ptr [rsp+44h]
                          mulss xmm1,xmm0
                          addss xmm2,xmm1
                          movss xmm1,dword ptr [rsp+48h]
                          movss xmm0,dword ptr [rsp+48h]
                          mulss xmm1,xmm0
                          addss xmm2,xmm1
                          cvtss2sd xmm0,xmm2
                          sqrtsd xmm1,xmm0
                          cvtsd2ss xmm2,xmm1
                          movss xmm0,dword ptr [00000160h]
                          divss xmm0,xmm2
                          movss xmm1,dword ptr [rsp+40h]
                          mulss xmm1,xmm0
                          movss xmm2,dword ptr [rsp+44h]
                          mulss xmm2,xmm0
                          movss xmm3,dword ptr [rsp+48h]
                          mulss xmm3,xmm0

                          The sqrt is still done in high precision. It got a tiny bit faster because of the lower precision divss instead of divsd. What it does now looks particularly silly to me though..

                          D 1 Reply Last reply
                          0
                          • L Lost User

                            harold aptroot wrote:

                            ASM: 127836 // anyone? what happened here?

                            Instructions written into memory generated by VirtualAlloc will most likely cause a L1/L2 cache miss. The extra clock cycles were probably spent utilizing the TLB to find the physical memory offset. You can try using the prefetchnta instruction to move the memory into L1 if you want to avoid the initial cache miss. Keep in mind that prefetchnta is only a hint and will sometimes be ignored under certain conditions. Best Wishes, -David Delaune

                            L Offline
                            L Offline
                            Lost User
                            wrote on last edited by
                            #31

                            Thanks, I'll keep that in mind :)

                            1 Reply Last reply
                            0
                            • G Giorgi Dalakishvili

                              What about writing an article? It would be interesting to see how you inject it in your application.

                              Giorgi Dalakishvili #region signature My Articles Browsing xkcd in a windows 7 way[^] #endregion

                              L Offline
                              L Offline
                              Lost User
                              wrote on last edited by
                              #32

                              Would this be enough for an article? And it also got some bad reactions.. Anyway, I assemble the code in the string into an array of bytes, then I VirtualAlloc a big enough piece of memory (with EXECUTE_READWRITE), then I copy the bytes to that memory, and then I use Marshal.GetDelegateForFunctionPointer. All of that including importing VirtualAlloc and VirtualFree and the enums that go with them is just 104 Lines of Code, but that's still a bit long to just post here..

                              C 1 Reply Last reply
                              0
                              • G Giorgi Dalakishvili

                                What about writing an article? It would be interesting to see how you inject it in your application.

                                Giorgi Dalakishvili #region signature My Articles Browsing xkcd in a windows 7 way[^] #endregion

                                E Offline
                                E Offline
                                ely_bob
                                wrote on last edited by
                                #33

                                This would be interesting to see, and there are not enough articles about working with the assembler. :thumbsup: if not I think a more detailed article based off of this thread would be in order (or a tip/trick) :)

                                I'd blame it on the Brain farts.. But let's be honest, it really is more like a Methane factory between my ears some days then it is anything else... -"The conversations he was having with himself were becoming ominous."-.. On the radio...

                                1 Reply Last reply
                                0
                                • L Lost User

                                  Daniel Grunwald wrote:

                                  Also, your optimized code is not equivalent - it has a much lower floating point precision.

                                  I did make a note to that effect - the non-cheating version was still over 3 times as fast the C# version.

                                  Daniel Grunwald wrote:

                                  Moreover, the way your C# code is written, even the cvtss2sd/cvtsd2ss dance is mandatory for the compiler:

                                  Fair enough. I made the change you suggested, and the result is:

                                  movss xmm2,dword ptr [rsp+40h]
                                  movss xmm0,dword ptr [rsp+40h]
                                  mulss xmm2,xmm0
                                  movss xmm1,dword ptr [rsp+44h]
                                  movss xmm0,dword ptr [rsp+44h]
                                  mulss xmm1,xmm0
                                  addss xmm2,xmm1
                                  movss xmm1,dword ptr [rsp+48h]
                                  movss xmm0,dword ptr [rsp+48h]
                                  mulss xmm1,xmm0
                                  addss xmm2,xmm1
                                  cvtss2sd xmm0,xmm2
                                  sqrtsd xmm1,xmm0
                                  cvtsd2ss xmm2,xmm1
                                  movss xmm0,dword ptr [00000160h]
                                  divss xmm0,xmm2
                                  movss xmm1,dword ptr [rsp+40h]
                                  mulss xmm1,xmm0
                                  movss xmm2,dword ptr [rsp+44h]
                                  mulss xmm2,xmm0
                                  movss xmm3,dword ptr [rsp+48h]
                                  mulss xmm3,xmm0

                                  The sqrt is still done in high precision. It got a tiny bit faster because of the lower precision divss instead of divsd. What it does now looks particularly silly to me though..

                                  D Offline
                                  D Offline
                                  Daniel Grunwald
                                  wrote on last edited by
                                  #34

                                  harold aptroot wrote:

                                  The sqrt is still done in high precision. It got a tiny bit faster because of the lower precision divss instead of divsd. What it does now looks particularly silly to me though..

                                  Yes, that's pretty silly; unless there are some special cases related to NaNs/infinities, it should be possible to optimize cvtss2sd/sqrtsd/cvtsd2ss to sqrtss. And if it's not possible, a float-overload should be added to Math.Sqrt. The double/triple loads also seem crazy, I thought the x64 JIT was optimizing those. At least I heard about redundant loads being eliminated on x64 (this affects multi-threaded semantics in some cases, e.g. unsynchronized bool stop;), so I wonder why it doesn't do that in your case. However no compiler will introduce packed instructions in this case - this is beyond what auto-vectorization can do for you. So in general, you can always beat a good compiler in floating point math. And beating the .NET JIT is even easier.

                                  L 1 Reply Last reply
                                  0
                                  • L Lost User

                                    I'm not sure I know what you mean, but I didn't write any MSIL, just native assembly

                                    C Offline
                                    C Offline
                                    Chris Trelawny Ross
                                    wrote on last edited by
                                    #35

                                    More fool me for not looking closely enough to see that you were writing native assembly, not IL assembly, and jumping to completely invalid conclusions. My apologies for being so careless presumptuous as to think you didn't know what you were saying.

                                    L 1 Reply Last reply
                                    0
                                    • C Chris Trelawny Ross

                                      More fool me for not looking closely enough to see that you were writing native assembly, not IL assembly, and jumping to completely invalid conclusions. My apologies for being so careless presumptuous as to think you didn't know what you were saying.

                                      L Offline
                                      L Offline
                                      Lost User
                                      wrote on last edited by
                                      #36

                                      No problem :)

                                      1 Reply Last reply
                                      0
                                      • D Daniel Grunwald

                                        harold aptroot wrote:

                                        The sqrt is still done in high precision. It got a tiny bit faster because of the lower precision divss instead of divsd. What it does now looks particularly silly to me though..

                                        Yes, that's pretty silly; unless there are some special cases related to NaNs/infinities, it should be possible to optimize cvtss2sd/sqrtsd/cvtsd2ss to sqrtss. And if it's not possible, a float-overload should be added to Math.Sqrt. The double/triple loads also seem crazy, I thought the x64 JIT was optimizing those. At least I heard about redundant loads being eliminated on x64 (this affects multi-threaded semantics in some cases, e.g. unsynchronized bool stop;), so I wonder why it doesn't do that in your case. However no compiler will introduce packed instructions in this case - this is beyond what auto-vectorization can do for you. So in general, you can always beat a good compiler in floating point math. And beating the .NET JIT is even easier.

                                        L Offline
                                        L Offline
                                        Lost User
                                        wrote on last edited by
                                        #37

                                        I experimented a bit - the triple loads do not happen when Float3 is a class. The code then becomes:

                                        movss xmm5,dword ptr [rax+8]
                                        movaps xmm1,xmm5
                                        mulss xmm1,xmm5
                                        movss xmm4,dword ptr [rax+0Ch]
                                        movaps xmm0,xmm4
                                        mulss xmm0,xmm4
                                        addss xmm1,xmm0
                                        movss xmm3,dword ptr [rax+10h]
                                        movss xmm0,xmm3
                                        mulss xmm0,xmm3
                                        addss xmm1,xmm0
                                        cvtss2sd xmm0,xmm1
                                        sqrtsd xmm1,xmm0
                                        cvtsd2ss xmm2,xmm1
                                        movss xmm8,dword ptr [00000128h]
                                        divss xmm8,xmm2
                                        movss xmm7,xmm8
                                        mulss xmm7,xmm5
                                        movss xmm6,xmm8
                                        mulss xmm6,xmm4
                                        mulss xmm8,xmm3

                                        However, in total the code gets a bit slower.

                                        1 Reply Last reply
                                        0
                                        • _ _Damian S_

                                          Impressive... but I call that someone needs to check around to see if there's a life somewhere they can grab!! :laugh:

                                          I don't have ADHD, I have ADOS... Attention Deficit oooh SHINY!! If you like cars, check out the Booger Mobile blog | If you feel generous - make a donation to Camp Quality!!

                                          E Offline
                                          E Offline
                                          Earl Truss
                                          wrote on last edited by
                                          #38

                                          I, for one, am surprised that anyone still talks about doing anything with assembler these days. Too many programmers I run into don't even know how to do arithmetic in hex or octal without a calculator.

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Don't have an account? Register

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups