Who says you can't beat the compiler? [part 1 of 2]
-
Actually I do, too. Otherwise newbies might try what I did here:
static void Normalize(Float3[] array)
{
for (int i = 0; i < array.Length; i++)
{
Float3 f = array[i];
float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
}
}Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)
movss xmm2,dword ptr [rsp+40h]
movss xmm0,dword ptr [rsp+40h]
mulss xmm2,xmm0
movss xmm1,dword ptr [rsp+44h]
movss xmm0,dword ptr [rsp+44h]
mulss xmm1,xmm0
addss xmm2,xmm1
movss xmm1,dword ptr [rsp+48h]
movss xmm0,dword ptr [rsp+48h]
mulss xmm1,xmm0
addss xmm2,xmm1
cvtss2sd xmm0,xmm2
sqrtsd xmm1,xmm0
movsd xmm2,mmword ptr [00000160h]
divsd xmm2,xmm1
cvtsd2ss xmm0,xmm2
movss xmm1,dword ptr [rsp+40h]
mulss xmm1,xmm0
movss xmm2,dword ptr [rsp+44h]
mulss xmm2,xmm0
movss xmm3,dword ptr [rsp+48h]
mulss xmm3,xmm0(why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?
Impressive... but I call that someone needs to check around to see if there's a life somewhere they can grab!! :laugh:
I don't have ADHD, I have ADOS... Attention Deficit oooh SHINY!! If you like cars, check out the Booger Mobile blog | If you feel generous - make a donation to Camp Quality!!
-
Actually I do, too. Otherwise newbies might try what I did here:
static void Normalize(Float3[] array)
{
for (int i = 0; i < array.Length; i++)
{
Float3 f = array[i];
float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
}
}Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)
movss xmm2,dword ptr [rsp+40h]
movss xmm0,dword ptr [rsp+40h]
mulss xmm2,xmm0
movss xmm1,dword ptr [rsp+44h]
movss xmm0,dword ptr [rsp+44h]
mulss xmm1,xmm0
addss xmm2,xmm1
movss xmm1,dword ptr [rsp+48h]
movss xmm0,dword ptr [rsp+48h]
mulss xmm1,xmm0
addss xmm2,xmm1
cvtss2sd xmm0,xmm2
sqrtsd xmm1,xmm0
movsd xmm2,mmword ptr [00000160h]
divsd xmm2,xmm1
cvtsd2ss xmm0,xmm2
movss xmm1,dword ptr [rsp+40h]
mulss xmm1,xmm0
movss xmm2,dword ptr [rsp+44h]
mulss xmm2,xmm0
movss xmm3,dword ptr [rsp+48h]
mulss xmm3,xmm0(why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?
I would think it would not take a lot of craftiness to beat the .NET or a JVM with ASM.
John
-
Actually I do, too. Otherwise newbies might try what I did here:
static void Normalize(Float3[] array)
{
for (int i = 0; i < array.Length; i++)
{
Float3 f = array[i];
float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
}
}Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)
movss xmm2,dword ptr [rsp+40h]
movss xmm0,dword ptr [rsp+40h]
mulss xmm2,xmm0
movss xmm1,dword ptr [rsp+44h]
movss xmm0,dword ptr [rsp+44h]
mulss xmm1,xmm0
addss xmm2,xmm1
movss xmm1,dword ptr [rsp+48h]
movss xmm0,dword ptr [rsp+48h]
mulss xmm1,xmm0
addss xmm2,xmm1
cvtss2sd xmm0,xmm2
sqrtsd xmm1,xmm0
movsd xmm2,mmword ptr [00000160h]
divsd xmm2,xmm1
cvtsd2ss xmm0,xmm2
movss xmm1,dword ptr [rsp+40h]
mulss xmm1,xmm0
movss xmm2,dword ptr [rsp+44h]
mulss xmm2,xmm0
movss xmm3,dword ptr [rsp+48h]
mulss xmm3,xmm0(why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?
Of course you can beat the compiler, the question is whether it's worth the time and effort. (Of course if you were really fanatical, you'd write this in native assembly and load either the 32-bit or 64-bit [or ARM or whatever] DLL and use that.)
-
I would think it would not take a lot of craftiness to beat the .NET or a JVM with ASM.
John
-
harold aptroot wrote:
Who says you can't beat the compiler?
I have never heard anyone say that.:~ Handcrafting assembly code has always been one way to optimize some critical routines. Though in practice it happens very rarely.
Rama Krishna Vavilala wrote:
I have never heard anyone say that
That's not surprising. I googled the phrase and only got 3.75 pages back.
Henry Minute Do not read medical books! You could die of a misprint. - Mark Twain Girl: (staring) "Why do you need an icy cucumber?" “I want to report a fraud. The government is lying to us all.”
-
Actually I do, too. Otherwise newbies might try what I did here:
static void Normalize(Float3[] array)
{
for (int i = 0; i < array.Length; i++)
{
Float3 f = array[i];
float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
}
}Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)
movss xmm2,dword ptr [rsp+40h]
movss xmm0,dword ptr [rsp+40h]
mulss xmm2,xmm0
movss xmm1,dword ptr [rsp+44h]
movss xmm0,dword ptr [rsp+44h]
mulss xmm1,xmm0
addss xmm2,xmm1
movss xmm1,dword ptr [rsp+48h]
movss xmm0,dword ptr [rsp+48h]
mulss xmm1,xmm0
addss xmm2,xmm1
cvtss2sd xmm0,xmm2
sqrtsd xmm1,xmm0
movsd xmm2,mmword ptr [00000160h]
divsd xmm2,xmm1
cvtsd2ss xmm0,xmm2
movss xmm1,dword ptr [rsp+40h]
mulss xmm1,xmm0
movss xmm2,dword ptr [rsp+44h]
mulss xmm2,xmm0
movss xmm3,dword ptr [rsp+48h]
mulss xmm3,xmm0(why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?
-
Surely we can do better than this. As it turns out, we can! I wrote a little class that uses
VirtualAlloc
(to get executable+writable memory) and a quickly hacked-together assembler so we can do this:Asm asm = new Asm(@" loop: test edx, edx jz end movaps xmm0, \[rcx\] movaps xmm2, xmm0 mulps xmm0, xmm0 movaps xmm1, xmm0 shufps xmm0, xmm0, ( 2, 1, 0, 3 ) addps xmm1, xmm0 movaps xmm0, xmm1 shufps xmm1, xmm1, ( 1, 0, 3, 2 ) addps xmm0, xmm1 rsqrtps xmm0, xmm0 mulps xmm0, xmm2 movaps \[rcx\], xmm0 add edx, -1 add rcx, 16 jmp loop end: ret ");
And to benchmark it, I used:
Float3\[\] f = new Float3\[0x1000\]; // we don't want to measure JIT overhead later Normalize(f); unsafe { asm.GetDelegate<Method>()((Float3\*)0, 0); } for (int j = 0; j < 10; j++) { for (int i = 0; i < f.Length; i++) f\[i\] = new Float3(1, 2, 3); Stopwatch s2 = Stopwatch.StartNew(); Normalize(f); s2.Stop(); Console.WriteLine("C#: " + s2.ElapsedTicks); for (int i = 0; i < f.Length; i++) f\[i\] = new Float3(1, 2, 3); Method m = asm.GetDelegate<Method>(); Stopwatch s = Stopwatch.StartNew(); unsafe { fixed (Float3\* fptr = f) { m(fptr, f.Length); } } s.Stop(); Console.WriteLine("ASM: " + s.ElapsedTicks); }
Finally it's time for the results! Did we beat the compiler? YES! Here's the result of one run:
C#: 624168
ASM: 127836 // anyone? what happened here?
C#: 615807
ASM: 66465
C#: 615780
ASM: 66294
C#: 615726
ASM: 66276
C#: 615717
ASM: 66285
C#: 615744
ASM: 66285
C#: 615726
ASM: 66276
C#: 617112
ASM: 66285
C#: 615726
ASM: 66285
C#: 615735
ASM: 66285So there you have it, you can beat the compiler. The .NET JIT compiler at least :) Some small notes: - using
harold aptroot wrote:
ASM: 127836 // anyone? what happened here?
Did you call
Marshal.Prelink
after getting the function pointer?xacc.ide
IronScheme - 1.0 RC 1 - out now!
((λ (x) `(,x ',x)) '(λ (x) `(,x ',x))) The Scheme Programming Language – Fourth Edition -
Is it just my perception, or is this actually the first legitimate technical article posted in the lounge? I give it a 5, btw.
puromtec1 wrote:
first legitimate technical article posted in the lounge?
I suspect he couldn't find his blog :)
xacc.ide
IronScheme - 1.0 RC 1 - out now!
((λ (x) `(,x ',x)) '(λ (x) `(,x ',x))) The Scheme Programming Language – Fourth Edition -
Of course you can beat the compiler, the question is whether it's worth the time and effort. (Of course if you were really fanatical, you'd write this in native assembly and load either the 32-bit or 64-bit [or ARM or whatever] DLL and use that.)
-
Actually I do, too. Otherwise newbies might try what I did here:
static void Normalize(Float3[] array)
{
for (int i = 0; i < array.Length; i++)
{
Float3 f = array[i];
float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
}
}Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)
movss xmm2,dword ptr [rsp+40h]
movss xmm0,dword ptr [rsp+40h]
mulss xmm2,xmm0
movss xmm1,dword ptr [rsp+44h]
movss xmm0,dword ptr [rsp+44h]
mulss xmm1,xmm0
addss xmm2,xmm1
movss xmm1,dword ptr [rsp+48h]
movss xmm0,dword ptr [rsp+48h]
mulss xmm1,xmm0
addss xmm2,xmm1
cvtss2sd xmm0,xmm2
sqrtsd xmm1,xmm0
movsd xmm2,mmword ptr [00000160h]
divsd xmm2,xmm1
cvtsd2ss xmm0,xmm2
movss xmm1,dword ptr [rsp+40h]
mulss xmm1,xmm0
movss xmm2,dword ptr [rsp+44h]
mulss xmm2,xmm0
movss xmm3,dword ptr [rsp+48h]
mulss xmm3,xmm0(why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?
-
Actually I do, too. Otherwise newbies might try what I did here:
static void Normalize(Float3[] array)
{
for (int i = 0; i < array.Length; i++)
{
Float3 f = array[i];
float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
}
}Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)
movss xmm2,dword ptr [rsp+40h]
movss xmm0,dword ptr [rsp+40h]
mulss xmm2,xmm0
movss xmm1,dword ptr [rsp+44h]
movss xmm0,dword ptr [rsp+44h]
mulss xmm1,xmm0
addss xmm2,xmm1
movss xmm1,dword ptr [rsp+48h]
movss xmm0,dword ptr [rsp+48h]
mulss xmm1,xmm0
addss xmm2,xmm1
cvtss2sd xmm0,xmm2
sqrtsd xmm1,xmm0
movsd xmm2,mmword ptr [00000160h]
divsd xmm2,xmm1
cvtsd2ss xmm0,xmm2
movss xmm1,dword ptr [rsp+40h]
mulss xmm1,xmm0
movss xmm2,dword ptr [rsp+44h]
mulss xmm2,xmm0
movss xmm3,dword ptr [rsp+48h]
mulss xmm3,xmm0(why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?
Agh! Reality! My Archnemesis![^]
| FoldWithUs! | sighist | WhoIncludes - Analyzing C++ include file hierarchy -
Actually I do, too. Otherwise newbies might try what I did here:
static void Normalize(Float3[] array)
{
for (int i = 0; i < array.Length; i++)
{
Float3 f = array[i];
float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
}
}Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)
movss xmm2,dword ptr [rsp+40h]
movss xmm0,dword ptr [rsp+40h]
mulss xmm2,xmm0
movss xmm1,dword ptr [rsp+44h]
movss xmm0,dword ptr [rsp+44h]
mulss xmm1,xmm0
addss xmm2,xmm1
movss xmm1,dword ptr [rsp+48h]
movss xmm0,dword ptr [rsp+48h]
mulss xmm1,xmm0
addss xmm2,xmm1
cvtss2sd xmm0,xmm2
sqrtsd xmm1,xmm0
movsd xmm2,mmword ptr [00000160h]
divsd xmm2,xmm1
cvtsd2ss xmm0,xmm2
movss xmm1,dword ptr [rsp+40h]
mulss xmm1,xmm0
movss xmm2,dword ptr [rsp+44h]
mulss xmm2,xmm0
movss xmm3,dword ptr [rsp+48h]
mulss xmm3,xmm0(why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?
Yeah, but the real question is - can you beat Intel C++ compiler?
[Genetic Algorithm Library] [Wowd]
modified on Wednesday, September 8, 2010 2:26 AM
-
Surely we can do better than this. As it turns out, we can! I wrote a little class that uses
VirtualAlloc
(to get executable+writable memory) and a quickly hacked-together assembler so we can do this:Asm asm = new Asm(@" loop: test edx, edx jz end movaps xmm0, \[rcx\] movaps xmm2, xmm0 mulps xmm0, xmm0 movaps xmm1, xmm0 shufps xmm0, xmm0, ( 2, 1, 0, 3 ) addps xmm1, xmm0 movaps xmm0, xmm1 shufps xmm1, xmm1, ( 1, 0, 3, 2 ) addps xmm0, xmm1 rsqrtps xmm0, xmm0 mulps xmm0, xmm2 movaps \[rcx\], xmm0 add edx, -1 add rcx, 16 jmp loop end: ret ");
And to benchmark it, I used:
Float3\[\] f = new Float3\[0x1000\]; // we don't want to measure JIT overhead later Normalize(f); unsafe { asm.GetDelegate<Method>()((Float3\*)0, 0); } for (int j = 0; j < 10; j++) { for (int i = 0; i < f.Length; i++) f\[i\] = new Float3(1, 2, 3); Stopwatch s2 = Stopwatch.StartNew(); Normalize(f); s2.Stop(); Console.WriteLine("C#: " + s2.ElapsedTicks); for (int i = 0; i < f.Length; i++) f\[i\] = new Float3(1, 2, 3); Method m = asm.GetDelegate<Method>(); Stopwatch s = Stopwatch.StartNew(); unsafe { fixed (Float3\* fptr = f) { m(fptr, f.Length); } } s.Stop(); Console.WriteLine("ASM: " + s.ElapsedTicks); }
Finally it's time for the results! Did we beat the compiler? YES! Here's the result of one run:
C#: 624168
ASM: 127836 // anyone? what happened here?
C#: 615807
ASM: 66465
C#: 615780
ASM: 66294
C#: 615726
ASM: 66276
C#: 615717
ASM: 66285
C#: 615744
ASM: 66285
C#: 615726
ASM: 66276
C#: 617112
ASM: 66285
C#: 615726
ASM: 66285
C#: 615735
ASM: 66285So there you have it, you can beat the compiler. The .NET JIT compiler at least :) Some small notes: - using
harold aptroot wrote:
ASM: 127836 // anyone? what happened here?
Instructions written into memory generated by VirtualAlloc will most likely cause a L1/L2 cache miss. The extra clock cycles were probably spent utilizing the TLB to find the physical memory offset. You can try using the prefetchnta instruction to move the memory into L1 if you want to avoid the initial cache miss. Keep in mind that prefetchnta is only a hint and will sometimes be ignored under certain conditions. Best Wishes, -David Delaune
-
Actually I do, too. Otherwise newbies might try what I did here:
static void Normalize(Float3[] array)
{
for (int i = 0; i < array.Length; i++)
{
Float3 f = array[i];
float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
}
}Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)
movss xmm2,dword ptr [rsp+40h]
movss xmm0,dword ptr [rsp+40h]
mulss xmm2,xmm0
movss xmm1,dword ptr [rsp+44h]
movss xmm0,dword ptr [rsp+44h]
mulss xmm1,xmm0
addss xmm2,xmm1
movss xmm1,dword ptr [rsp+48h]
movss xmm0,dword ptr [rsp+48h]
mulss xmm1,xmm0
addss xmm2,xmm1
cvtss2sd xmm0,xmm2
sqrtsd xmm1,xmm0
movsd xmm2,mmword ptr [00000160h]
divsd xmm2,xmm1
cvtsd2ss xmm0,xmm2
movss xmm1,dword ptr [rsp+40h]
mulss xmm1,xmm0
movss xmm2,dword ptr [rsp+44h]
mulss xmm2,xmm0
movss xmm3,dword ptr [rsp+48h]
mulss xmm3,xmm0(why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?
I haven't seen a compiler that can properly use SSE2 yet. Auto-vectorization often only works in trivial cases, in other cases packed instructions go unused. However, instead of dropping to assembler, you can to write C code using intrinsics - this way you only select the ASM instructions to use, and the compiler will pick the instruction ordering and register allocation for you. And unlike inline ASM code, intrinsics are portable between multiple C compilers and between x86 and x86-64. Also, your optimized code is not equivalent - it has a much lower floating point precision.
RSQRTPS
is (a lot) faster thanSQRTPS
, but has much less precision. You need to do an additional Newton-Raphson step to arrive somewhere close to normal float precision. Of course, the loss of precision may be acceptable in your case, but it's the reason compilers cannot do this optimization automatically. Moreover, the way your C# code is written, even thecvtss2sd
/cvtsd2ss
dance is mandatory for the compiler: you are dividing a double (1.0) by a double (Math.Sqrt result), so the compiler is not allowed to introduce additional rounding errors by rounding the intermediate result to float. You might have gotten more efficient code by writingfloat invLen = 1.0f / (float)Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z);
-
Surely we can do better than this. As it turns out, we can! I wrote a little class that uses
VirtualAlloc
(to get executable+writable memory) and a quickly hacked-together assembler so we can do this:Asm asm = new Asm(@" loop: test edx, edx jz end movaps xmm0, \[rcx\] movaps xmm2, xmm0 mulps xmm0, xmm0 movaps xmm1, xmm0 shufps xmm0, xmm0, ( 2, 1, 0, 3 ) addps xmm1, xmm0 movaps xmm0, xmm1 shufps xmm1, xmm1, ( 1, 0, 3, 2 ) addps xmm0, xmm1 rsqrtps xmm0, xmm0 mulps xmm0, xmm2 movaps \[rcx\], xmm0 add edx, -1 add rcx, 16 jmp loop end: ret ");
And to benchmark it, I used:
Float3\[\] f = new Float3\[0x1000\]; // we don't want to measure JIT overhead later Normalize(f); unsafe { asm.GetDelegate<Method>()((Float3\*)0, 0); } for (int j = 0; j < 10; j++) { for (int i = 0; i < f.Length; i++) f\[i\] = new Float3(1, 2, 3); Stopwatch s2 = Stopwatch.StartNew(); Normalize(f); s2.Stop(); Console.WriteLine("C#: " + s2.ElapsedTicks); for (int i = 0; i < f.Length; i++) f\[i\] = new Float3(1, 2, 3); Method m = asm.GetDelegate<Method>(); Stopwatch s = Stopwatch.StartNew(); unsafe { fixed (Float3\* fptr = f) { m(fptr, f.Length); } } s.Stop(); Console.WriteLine("ASM: " + s.ElapsedTicks); }
Finally it's time for the results! Did we beat the compiler? YES! Here's the result of one run:
C#: 624168
ASM: 127836 // anyone? what happened here?
C#: 615807
ASM: 66465
C#: 615780
ASM: 66294
C#: 615726
ASM: 66276
C#: 615717
ASM: 66285
C#: 615744
ASM: 66285
C#: 615726
ASM: 66276
C#: 617112
ASM: 66285
C#: 615726
ASM: 66285
C#: 615735
ASM: 66285So there you have it, you can beat the compiler. The .NET JIT compiler at least :) Some small notes: - using
Sorry about stupid question but where does the Asm class come from?
Giorgi Dalakishvili #region signature My Articles Browsing xkcd in a windows 7 way[^] #endregion
-
Sorry about stupid question but where does the Asm class come from?
Giorgi Dalakishvili #region signature My Articles Browsing xkcd in a windows 7 way[^] #endregion
I am also interested in the answer :)
-
harold aptroot wrote:
ASM: 127836 // anyone? what happened here?
Did you call
Marshal.Prelink
after getting the function pointer?xacc.ide
IronScheme - 1.0 RC 1 - out now!
((λ (x) `(,x ',x)) '(λ (x) `(,x ',x))) The Scheme Programming Language – Fourth Edition -
Actually I do, too. Otherwise newbies might try what I did here:
static void Normalize(Float3[] array)
{
for (int i = 0; i < array.Length; i++)
{
Float3 f = array[i];
float invLen = (float)(1.0 / Math.Sqrt(f.x * f.x + f.y * f.y + f.z * f.z));
array[i] = new Float3(f.x * invLen, f.y * invLen, f.z * invLen);
}
}Ok not here yet, but the part with the assembly and all. This is the 'unoptimized' C# code. It's not meant to be pretty, and actually has a low level optimization already - 3 multiplications and a division is faster than 3 divisions. Using 3 divisions would pain me too much to even consider.. The Just In Time compiler didn't even do such a bad job here, the Math.Sqrt gets nicely compiled to fsqrt (32bit mode) or sqrtsd (64bit mode), and not too much nonsense goes on around it either. It's a shame it has no clue how to fully use SSE though. This looks like a rather lame use of SSE to me: (x y and x are in rsp+40h,44h and 48h)
movss xmm2,dword ptr [rsp+40h]
movss xmm0,dword ptr [rsp+40h]
mulss xmm2,xmm0
movss xmm1,dword ptr [rsp+44h]
movss xmm0,dword ptr [rsp+44h]
mulss xmm1,xmm0
addss xmm2,xmm1
movss xmm1,dword ptr [rsp+48h]
movss xmm0,dword ptr [rsp+48h]
mulss xmm1,xmm0
addss xmm2,xmm1
cvtss2sd xmm0,xmm2
sqrtsd xmm1,xmm0
movsd xmm2,mmword ptr [00000160h]
divsd xmm2,xmm1
cvtsd2ss xmm0,xmm2
movss xmm1,dword ptr [rsp+40h]
mulss xmm1,xmm0
movss xmm2,dword ptr [rsp+44h]
mulss xmm2,xmm0
movss xmm3,dword ptr [rsp+48h]
mulss xmm3,xmm0(why is none of this coloured? I used lang="asm") This just makes me sad. We can see here how it puts the arguments to the Float3 ctor in xmm1:xmm3, which is correct according to the specs;[^], because the first argument will be a pointer to "where to new struct will be put" in rcx (code omitted, it's not very interesting anyway). This post is split because apparently there's a length limit..?
Umm ...
harold aptroot wrote:
The Just In Time compiler didn't even do such a bad job here...
... the Just in Time compiler takes as its input the IL code that is generated by the C# compiler or, in the case of the IL assembly language you hand crafted, the output of the IL assembler. So, actually, the JIT compiler hasn't done anything yet. :-\
-
Sorry about stupid question but where does the Asm class come from?
Giorgi Dalakishvili #region signature My Articles Browsing xkcd in a windows 7 way[^] #endregion
-
Yeah, but the real question is - can you beat Intel C++ compiler?
[Genetic Algorithm Library] [Wowd]
modified on Wednesday, September 8, 2010 2:26 AM