Who says you can't beat the compiler? [part 1 of 2]

Giorgi Dalakishvili

What about writing an article? It would be interesting to see how you inject it in your application.

Giorgi Dalakishvili #region signature My Articles Browsing xkcd in a windows 7 way[^] #endregion

Lost User

Daniel Grunwald wrote:

Also, your optimized code is not equivalent - it has a much lower floating point precision.

I did make a note to that effect - the non-cheating version was still over 3 times as fast the C# version.

Daniel Grunwald wrote:

Moreover, the way your C# code is written, even the cvtss2sd/cvtsd2ss dance is mandatory for the compiler:

Fair enough. I made the change you suggested, and the result is:

movss xmm2,dword ptr [rsp+40h]
movss xmm0,dword ptr [rsp+40h]
mulss xmm2,xmm0
movss xmm1,dword ptr [rsp+44h]
movss xmm0,dword ptr [rsp+44h]
mulss xmm1,xmm0
addss xmm2,xmm1
movss xmm1,dword ptr [rsp+48h]
movss xmm0,dword ptr [rsp+48h]
mulss xmm1,xmm0
addss xmm2,xmm1
cvtss2sd xmm0,xmm2
sqrtsd xmm1,xmm0
cvtsd2ss xmm2,xmm1
movss xmm0,dword ptr [00000160h]
divss xmm0,xmm2
movss xmm1,dword ptr [rsp+40h]
mulss xmm1,xmm0
movss xmm2,dword ptr [rsp+44h]
mulss xmm2,xmm0
movss xmm3,dword ptr [rsp+48h]
mulss xmm3,xmm0

The sqrt is still done in high precision. It got a tiny bit faster because of the lower precision divss instead of divsd. What it does now looks particularly silly to me though..

Lost User

Thanks, I'll keep that in mind :)

Lost User

Would this be enough for an article? And it also got some bad reactions.. Anyway, I assemble the code in the string into an array of bytes, then I VirtualAlloc a big enough piece of memory (with EXECUTE_READWRITE), then I copy the bytes to that memory, and then I use Marshal.GetDelegateForFunctionPointer. All of that including importing VirtualAlloc and VirtualFree and the enums that go with them is just 104 Lines of Code, but that's still a bit long to just post here..

ely_bob

This would be interesting to see, and there are not enough articles about working with the assembler. :thumbsup: if not I think a more detailed article based off of this thread would be in order (or a tip/trick) :)

I'd blame it on the Brain farts.. But let's be honest, it really is more like a Methane factory between my ears some days then it is anything else... -"The conversations he was having with himself were becoming ominous."-.. On the radio...

Daniel Grunwald

harold aptroot wrote:

The sqrt is still done in high precision. It got a tiny bit faster because of the lower precision divss instead of divsd. What it does now looks particularly silly to me though..

Yes, that's pretty silly; unless there are some special cases related to NaNs/infinities, it should be possible to optimize cvtss2sd/sqrtsd/cvtsd2ss to sqrtss. And if it's not possible, a float-overload should be added to Math.Sqrt. The double/triple loads also seem crazy, I thought the x64 JIT was optimizing those. At least I heard about redundant loads being eliminated on x64 (this affects multi-threaded semantics in some cases, e.g. unsynchronized bool stop;), so I wonder why it doesn't do that in your case. However no compiler will introduce packed instructions in this case - this is beyond what auto-vectorization can do for you. So in general, you can always beat a good compiler in floating point math. And beating the .NET JIT is even easier.

Chris Trelawny Ross

More fool me for not looking closely enough to see that you were writing native assembly, not IL assembly, and jumping to completely invalid conclusions. My apologies for being so careless presumptuous as to think you didn't know what you were saying.

Lost User

No problem :)

Lost User

I experimented a bit - the triple loads do not happen when Float3 is a class. The code then becomes:

movss xmm5,dword ptr [rax+8]
movaps xmm1,xmm5
mulss xmm1,xmm5
movss xmm4,dword ptr [rax+0Ch]
movaps xmm0,xmm4
mulss xmm0,xmm4
addss xmm1,xmm0
movss xmm3,dword ptr [rax+10h]
movss xmm0,xmm3
mulss xmm0,xmm3
addss xmm1,xmm0
cvtss2sd xmm0,xmm1
sqrtsd xmm1,xmm0
cvtsd2ss xmm2,xmm1
movss xmm8,dword ptr [00000128h]
divss xmm8,xmm2
movss xmm7,xmm8
mulss xmm7,xmm5
movss xmm6,xmm8
mulss xmm6,xmm4
mulss xmm8,xmm3

However, in total the code gets a bit slower.

Earl Truss

I, for one, am surprised that anyone still talks about doing anything with assembler these days. Too many programmers I run into don't even know how to do arithmetic in hex or octal without a calculator.

Lost User

Well thanks, I've done something like this before though - multiple times actually :) But this time it was better (and longer)

Cesar de Souza

It would be interesting if you could write an article about this - I am already giving it a five ;P

Interested in Machine Learning in .NET? Check my article about Support Vector Machines in C# in Handwriting Recognition Revisited: Kernel Support Vector Machines using the Accord.NET Framework.

Lost User

Alright, but it'll likely take a while, as I'm still working on an other article .. although maybe I'll do this one first. I don't know yet. Also I'll have a little less time now because University starts again tomorrow..

Mike Marynowski

The output from his Asm class is dynamically generated native code. He marshals a delegate that points into his native method and calls it. He posted a brief description of his Asm class some posts up.