Any one using Nvidia's CUDA
-
kmg365 wrote:
It's the second one.
Kinda thought it might be ;) All I know about SAS is that my cousin used to be a SAS contractor.
Java, Basic, who cares - it's all a bunch of tree-hugging hippy cr*p
Stuart Dootson wrote:
my cousin used to be a SAS contractor
Made a lot of money and went insane.
Never underestimate the power of human stupidity RAH
-
Stuart Dootson wrote:
my cousin used to be a SAS contractor
Made a lot of money and went insane.
Never underestimate the power of human stupidity RAH
Ah, you know him...
Java, Basic, who cares - it's all a bunch of tree-hugging hippy cr*p
-
No but I am planning to use OpenCL. I am not really sure how much they differ.
-
No but I am planning to use OpenCL. I am not really sure how much they differ.
At a guess, one's called CUDA and the other OpenCL.
print "http://www.codeproject.com".toURL().text Ain't that Groovy?
-
At a guess, one's called CUDA and the other OpenCL.
print "http://www.codeproject.com".toURL().text Ain't that Groovy?
-
It's used by Google in their search engines. I have an interesting post about GPU's and Amdahls's Law on Nvidia: How can CUDA break Amdahl's Law?[^] If you are thinking of using CUDA, you might want to read the post first, it may or may not help your team make a decision to use CUDA. The truth about CUDA is revealed. ~TheArch :cool:
modified on Wednesday, July 22, 2009 10:44 PM
-
If you can find me the CUDA Language Spec (Op Code) I will make a IL -> CUDA IL assemble linker for it. I looked around for it a few weeks ago, but could not find it. The alternitive it to use DirectML, but I have no idea when it's comming out?!?! ~TheArch
-
If you can find me the CUDA Language Spec (Op Code) I will make a IL -> CUDA IL assemble linker for it. I looked around for it a few weeks ago, but could not find it. The alternitive it to use DirectML, but I have no idea when it's comming out?!?! ~TheArch
-
Yea I tried that some time ago, but while it is nice it only works on 32bit mode and/because it uses DirectX. It's convenient, but the performance is not great, and in 64bit mode it just dies because the DirectX dll's die (MS's fault) which makes it impossible to use it in a plugin for Paint.NET without making it multi-process but that would completely kill the performance. And it seems a bit abandoned.
-
-
:thumbsup:
harold aptroot wrote:
Sweet, seems a bit large for a 1 man project though, need any help?
That would be great. Pick your flavor! Prototype AutoParalizer Archiecture:
1. Tool to parse the target .NET IL. (Reflect onto IL using Relflection.Emit)
2. Assembly Table Linker (HashTable to translate the IL into PSX)
3. CUDA integration:
a. Generate parallel mini functions in CUDA from #2
b. Markup .NET IL with 3.b
4. Recompile everyithing 3.a & 3.b'Did I miss something?' ~TheArch
-
:thumbsup:
harold aptroot wrote:
Sweet, seems a bit large for a 1 man project though, need any help?
That would be great. Pick your flavor! Prototype AutoParalizer Archiecture:
1. Tool to parse the target .NET IL. (Reflect onto IL using Relflection.Emit)
2. Assembly Table Linker (HashTable to translate the IL into PSX)
3. CUDA integration:
a. Generate parallel mini functions in CUDA from #2
b. Markup .NET IL with 3.b
4. Recompile everyithing 3.a & 3.b'Did I miss something?' ~TheArch
Hm I dunno, but some observations: - MSIL uses a read-only* stack so essentially it's equivalent to SSA (source: cr88192) - Due to that there is no direct mapping of MSIL onto PTX, at the very least you need register allocation, but since it's SSA the interference graph will be chordal, so graph colouring has a polynomial running time (so optimal register usage isn't even hard) - local/shared/shared-at-other-level/etc could be a problem, defaulting to global-ish is extremely slow but determining where it should go is probably hard (sounds like escape analysis to me, which is hard) - Different kinds of loads/stores, might take some additional analysis to figure out which one to use (mov vs ld vs tex etc) - There is no GC in CUDA - but classes wouldn't be all that useful there anyway, seems ok to me to limit it to structs - There is probably more to this than might be seen at a first glance.. - ???? - Profit * ok let me clarify, I don't really mean that it doesn't write to the stack - I mean that it doesn't overwrite things deep down in the stack, it just pushes thing onto it. Well except in some rare and creepy cases anyway.
-
Hm I dunno, but some observations: - MSIL uses a read-only* stack so essentially it's equivalent to SSA (source: cr88192) - Due to that there is no direct mapping of MSIL onto PTX, at the very least you need register allocation, but since it's SSA the interference graph will be chordal, so graph colouring has a polynomial running time (so optimal register usage isn't even hard) - local/shared/shared-at-other-level/etc could be a problem, defaulting to global-ish is extremely slow but determining where it should go is probably hard (sounds like escape analysis to me, which is hard) - Different kinds of loads/stores, might take some additional analysis to figure out which one to use (mov vs ld vs tex etc) - There is no GC in CUDA - but classes wouldn't be all that useful there anyway, seems ok to me to limit it to structs - There is probably more to this than might be seen at a first glance.. - ???? - Profit * ok let me clarify, I don't really mean that it doesn't write to the stack - I mean that it doesn't overwrite things deep down in the stack, it just pushes thing onto it. Well except in some rare and creepy cases anyway.
harold aptroot wrote:
- MSIL uses a read-only stack so essentially it's equivalent to SSA (source: cr88192)
Hmm, okay I'll take your word on it. But the emited IL won't be read only after decomposing it and saving to a new file.
harold aptroot wrote:
- Due to that there is no direct mapping of MSIL onto PTX, at the very least you need register allocation, but since it's SSA the interference graph will be chordal, so graph colouring has a polynomial running time (so optimal register usage isn't even hard)
Yeah, I though as much. From my first 20 sec glance at it, I think there are about 45% direct, the others will have to use some diffrent translation logic.
harold aptroot wrote:
- local/shared/shared-at-other-level/etc could be a problem, defaulting to global-ish is extremely slow but determining where it should go is probably hard (sounds like escape analysis to me, which is hard)
Hmm, I don't know much about this I will have to research it in the morning.
harold aptroot wrote:
- Different kinds of loads/stores, might take some additional analysis to figure out which one to use (mov vs ld vs tex etc)
Yeah, this is similar to #2 on your list. The I invision the translation lib will have custom functions in PSX. ie. we run into 'String s = new String("something")' translate to PSX function 'CUDA.String s = new CUDA.String((CUDA.String)"something")'.
harold aptroot wrote:
- There is no GC in CUDA - but classes wouldn't be all that useful there anyway, seems ok to me to limit it to structs
Correct I think?!?!? This is where it becomes very important to correctly use the shared 16k memory space. We can clean it up on the .NET side.
harold aptroot wrote:
- There is probably more to this than might be seen at a first glance..
Yeah, google LABVIEW. This will break down most of our barriers.
harold aptroot wrote:
- Profit
I sugest a good prototype here on The Code Project. Then after we get some interest and more help a professional version with better features. We won't handicap the prototype, but the pro version would have many performance enhancments and
-
harold aptroot wrote:
- MSIL uses a read-only stack so essentially it's equivalent to SSA (source: cr88192)
Hmm, okay I'll take your word on it. But the emited IL won't be read only after decomposing it and saving to a new file.
harold aptroot wrote:
- Due to that there is no direct mapping of MSIL onto PTX, at the very least you need register allocation, but since it's SSA the interference graph will be chordal, so graph colouring has a polynomial running time (so optimal register usage isn't even hard)
Yeah, I though as much. From my first 20 sec glance at it, I think there are about 45% direct, the others will have to use some diffrent translation logic.
harold aptroot wrote:
- local/shared/shared-at-other-level/etc could be a problem, defaulting to global-ish is extremely slow but determining where it should go is probably hard (sounds like escape analysis to me, which is hard)
Hmm, I don't know much about this I will have to research it in the morning.
harold aptroot wrote:
- Different kinds of loads/stores, might take some additional analysis to figure out which one to use (mov vs ld vs tex etc)
Yeah, this is similar to #2 on your list. The I invision the translation lib will have custom functions in PSX. ie. we run into 'String s = new String("something")' translate to PSX function 'CUDA.String s = new CUDA.String((CUDA.String)"something")'.
harold aptroot wrote:
- There is no GC in CUDA - but classes wouldn't be all that useful there anyway, seems ok to me to limit it to structs
Correct I think?!?!? This is where it becomes very important to correctly use the shared 16k memory space. We can clean it up on the .NET side.
harold aptroot wrote:
- There is probably more to this than might be seen at a first glance..
Yeah, google LABVIEW. This will break down most of our barriers.
harold aptroot wrote:
- Profit
I sugest a good prototype here on The Code Project. Then after we get some interest and more help a professional version with better features. We won't handicap the prototype, but the pro version would have many performance enhancments and
Harold wrote:
- ???? - Profit
Never seen that one before? :) On the SSA/MSIL thing - it's the operand stack that would be read-only, making the opcodes "implicit SSA" as I'll call it now (because the operands are not listed but inferred from the stack, and that also ensures that it really is SSA form - albeit implicit) Ok I saw LABVIEW, what am I supposed to see though? *bookmarks project page*
-
Harold wrote:
- ???? - Profit
Never seen that one before? :) On the SSA/MSIL thing - it's the operand stack that would be read-only, making the opcodes "implicit SSA" as I'll call it now (because the operands are not listed but inferred from the stack, and that also ensures that it really is SSA form - albeit implicit) Ok I saw LABVIEW, what am I supposed to see though? *bookmarks project page*
-
It's used by Google in their search engines. I have an interesting post about GPU's and Amdahls's Law on Nvidia: How can CUDA break Amdahl's Law?[^] If you are thinking of using CUDA, you might want to read the post first, it may or may not help your team make a decision to use CUDA. The truth about CUDA is revealed. ~TheArch :cool:
modified on Wednesday, July 22, 2009 10:44 PM