threading differences VC6 v VS2005
-
i have a bit of C++ code that does some pretty heavy-duty image processing work. sadly, the code wasn't running as fast as i'd hoped, so, after doing all the optimizations i could think of, i decided to try multi-threading it. i broke the job into strips, with each thread taking a horizontal strip of the image. the algorithm is the same, and 99% of the code is the same; the only difference is some bounds-checking code that the multi-threaded version skips. there is no communication between threads. the multi-threaded version was more than twice as slow as the single-threaded version - i kept the single-threaded version of the code intact, for benchmarking. i wrote it using VC6 (because I prefer the IDE to the VS.Net IDEs). so, i opened the project in VS05, built and ran it - hoping the VS05 compiler had some better optimization. but no, the single-threaded version ran exactly as fast as it did when built with VC6. on the other hand, in VS05, the multi-threaded version ran twice as fast as the single-threaded version. again: the multi-threaded code went from twice as slow to twice as fast, simply by changing the compiler, but the single-threaded version didn't change at all. these projects are all built with the multi-threaded CLR, same optimizations (controlled by pragmas in the code). the single and multi-threaded function live in the same module (same class, in fact). so, something changed between VC6 and VS05 which affects how fast threads run. does anybody know what that change was ?
-
i have a bit of C++ code that does some pretty heavy-duty image processing work. sadly, the code wasn't running as fast as i'd hoped, so, after doing all the optimizations i could think of, i decided to try multi-threading it. i broke the job into strips, with each thread taking a horizontal strip of the image. the algorithm is the same, and 99% of the code is the same; the only difference is some bounds-checking code that the multi-threaded version skips. there is no communication between threads. the multi-threaded version was more than twice as slow as the single-threaded version - i kept the single-threaded version of the code intact, for benchmarking. i wrote it using VC6 (because I prefer the IDE to the VS.Net IDEs). so, i opened the project in VS05, built and ran it - hoping the VS05 compiler had some better optimization. but no, the single-threaded version ran exactly as fast as it did when built with VC6. on the other hand, in VS05, the multi-threaded version ran twice as fast as the single-threaded version. again: the multi-threaded code went from twice as slow to twice as fast, simply by changing the compiler, but the single-threaded version didn't change at all. these projects are all built with the multi-threaded CLR, same optimizations (controlled by pragmas in the code). the single and multi-threaded function live in the same module (same class, in fact). so, something changed between VC6 and VS05 which affects how fast threads run. does anybody know what that change was ?
Hi Chris, I don't think the IDE is fundamentally involved in the difference of the results. IMO you either are unfortunate or made some major mistake. Here are some possible mistakes: - a total disregard of "locality of reference": in the ideal case all data ever gets loaded only once in the data cache(s); having two or more threads working somewhat independently may make that harder to achieve. - too much need for synchronization; - an overall alignment problem, e.g. when both threads need new cache loads at the same time, they both sit idle waiting for main memory to respond. At best memory accesses should be staggered somehow by design. - cache trashing for shared data (when the same data resides in the data cache of two distinct CPU chips and gets written by one processor, it needs updating in the other cache, a very expensive operation); the extreme case is when synchronization is based on spin locks, and two or more spin lock variables reside in the same cache line. (There is an interesting Intel Application Note on keeping no more than one sync variable in a cache line, applicable to dual processor chips, not Core2Duo). Image processing generally is quite vulnerable to all kinds of such phenomena, mainly because there is a rather high transport/calculation ratio, where data streaming can work well (making the job entirely compute bound) and can work very badly (making it bus bandwidth bound); and the difference may be minor and accidental, as well as major by a faulty approach. I would suggest you research this case starting with a simple experiment: ignore the middle border pixels, i.e. let one thread handle the left 45% of the image, the other the rightside 45% of the image, and ignore the middle 10%. Have it run on both IDE; if you did everything right you should notice it runs about twice as fast as a single-threaded solution, and it is IDE independent. Then all you have to do is look into the way you handle the middle 10% of the image. BTW: don't forget to observe CPU load in Task Manager for both the 1 and 2-threaded attempts. If OTOH the two-thread solution isn't really twice as fast as the single-threaded one, forget about the middle 10% (and all sync stuff) until you got the threading right for the independent image areas. If you were to want more specific advice, please provide detailed information on the task: system hardware, typical image size, kind of image operation (e.g. filter size, is it inherently sequential?), etc. :)
Luc Pattyn
-
Hi Chris, I don't think the IDE is fundamentally involved in the difference of the results. IMO you either are unfortunate or made some major mistake. Here are some possible mistakes: - a total disregard of "locality of reference": in the ideal case all data ever gets loaded only once in the data cache(s); having two or more threads working somewhat independently may make that harder to achieve. - too much need for synchronization; - an overall alignment problem, e.g. when both threads need new cache loads at the same time, they both sit idle waiting for main memory to respond. At best memory accesses should be staggered somehow by design. - cache trashing for shared data (when the same data resides in the data cache of two distinct CPU chips and gets written by one processor, it needs updating in the other cache, a very expensive operation); the extreme case is when synchronization is based on spin locks, and two or more spin lock variables reside in the same cache line. (There is an interesting Intel Application Note on keeping no more than one sync variable in a cache line, applicable to dual processor chips, not Core2Duo). Image processing generally is quite vulnerable to all kinds of such phenomena, mainly because there is a rather high transport/calculation ratio, where data streaming can work well (making the job entirely compute bound) and can work very badly (making it bus bandwidth bound); and the difference may be minor and accidental, as well as major by a faulty approach. I would suggest you research this case starting with a simple experiment: ignore the middle border pixels, i.e. let one thread handle the left 45% of the image, the other the rightside 45% of the image, and ignore the middle 10%. Have it run on both IDE; if you did everything right you should notice it runs about twice as fast as a single-threaded solution, and it is IDE independent. Then all you have to do is look into the way you handle the middle 10% of the image. BTW: don't forget to observe CPU load in Task Manager for both the 1 and 2-threaded attempts. If OTOH the two-thread solution isn't really twice as fast as the single-threaded one, forget about the middle 10% (and all sync stuff) until you got the threading right for the independent image areas. If you were to want more specific advice, please provide detailed information on the task: system hardware, typical image size, kind of image operation (e.g. filter size, is it inherently sequential?), etc. :)
Luc Pattyn
Luc Pattyn wrote:
gnore the middle border pixels, i.e. let one thread handle the left 45% of the image, the other the rightside 45% of the image, and ignore the middle 10%.
i think i updated my post since you started your reply. the process here isn't the 50/50 split with the heavy inter-thread communication, which i initially described - that was a different part of the job, the one i stayed up all night fighting with. rather, the process is four totally independent threads going after different horizontal strips of an image, doing a Sobel filter (a two kernel convolution) (tried vertical strips, too - slightly slower). the only difference in code for the multi-threaded operation is that threads working on strips which don't touch the top or bottom of the image don't have to worry about bounds checking.
Luc Pattyn wrote:
I don't think the IDE is fundamentally involved in the difference of the results.
it is IDE (or at least CRT) dependent. single-threaded, VC6: 0.31 s multi-threaded, VC6: 1.0 s single-threaded, VS05: 0.31 s multi-threaded, VS05: 0.14 s VS08 results are the same as the VS05 results. exact same code, same machine, same input and parameters, tested multiple times. the single-threaded result confirms that the difference must be in how threading is handled by the different VC versions.
-
Luc Pattyn wrote:
gnore the middle border pixels, i.e. let one thread handle the left 45% of the image, the other the rightside 45% of the image, and ignore the middle 10%.
i think i updated my post since you started your reply. the process here isn't the 50/50 split with the heavy inter-thread communication, which i initially described - that was a different part of the job, the one i stayed up all night fighting with. rather, the process is four totally independent threads going after different horizontal strips of an image, doing a Sobel filter (a two kernel convolution) (tried vertical strips, too - slightly slower). the only difference in code for the multi-threaded operation is that threads working on strips which don't touch the top or bottom of the image don't have to worry about bounds checking.
Luc Pattyn wrote:
I don't think the IDE is fundamentally involved in the difference of the results.
it is IDE (or at least CRT) dependent. single-threaded, VC6: 0.31 s multi-threaded, VC6: 1.0 s single-threaded, VS05: 0.31 s multi-threaded, VS05: 0.14 s VS08 results are the same as the VS05 results. exact same code, same machine, same input and parameters, tested multiple times. the single-threaded result confirms that the difference must be in how threading is handled by the different VC versions.
Hi Chris, yes I was replying to your post in VS forum, however by the time I had finished, your message was gone; I found this one here, but didn't notice it was somewhat different. Of the four numbers you mention indeed the multi-threaded VC6 is odd, I still don't believe it is due to the IDE, compiler progress is steady but slow, I have never seen a huge gain from one version tot he next. The single-thread results being equal suggests very similar code was generated, and the two-threaded VS05 being twice as fast tells us your multi-threading seems to work just fine. I can't imagine they ever wasted up to 1 second in launching some threads. Maybe the runtime libraries were structured quite differently, causing other/more/larger DLL files to be loaded after you started your timing. How did you measure these times? how did you build and run? Here are some experiment suggestions: - enclose code in a for(i=0; i<10; i++) loop and measure 10 executions inside a single process, watching for variance due to code loading, memory allocation effects, and the like. (I always do this, I have seen too many weird things going on in a first execution) - release build - run outside VS :)
Luc Pattyn [Forum Guidelines] [My Articles]
The quality and detail of your question reflects on the effectiveness of the help you are likely to get. Show formatted code inside PRE tags, and give clear symptoms when describing a problem.
-
Luc Pattyn wrote:
gnore the middle border pixels, i.e. let one thread handle the left 45% of the image, the other the rightside 45% of the image, and ignore the middle 10%.
i think i updated my post since you started your reply. the process here isn't the 50/50 split with the heavy inter-thread communication, which i initially described - that was a different part of the job, the one i stayed up all night fighting with. rather, the process is four totally independent threads going after different horizontal strips of an image, doing a Sobel filter (a two kernel convolution) (tried vertical strips, too - slightly slower). the only difference in code for the multi-threaded operation is that threads working on strips which don't touch the top or bottom of the image don't have to worry about bounds checking.
Luc Pattyn wrote:
I don't think the IDE is fundamentally involved in the difference of the results.
it is IDE (or at least CRT) dependent. single-threaded, VC6: 0.31 s multi-threaded, VC6: 1.0 s single-threaded, VS05: 0.31 s multi-threaded, VS05: 0.14 s VS08 results are the same as the VS05 results. exact same code, same machine, same input and parameters, tested multiple times. the single-threaded result confirms that the difference must be in how threading is handled by the different VC versions.
Hi Chris, FWIW: there was a time VS was terribly slow at dealing with the first exception in an app. Not sure which versions and languages, but it was not very constant and in the order of seconds. Maybe that is what you're having? :)
Luc Pattyn [Forum Guidelines] [My Articles]
The quality and detail of your question reflects on the effectiveness of the help you are likely to get. Show formatted code inside PRE tags, and give clear symptoms when describing a problem.
-
Hi Chris, yes I was replying to your post in VS forum, however by the time I had finished, your message was gone; I found this one here, but didn't notice it was somewhat different. Of the four numbers you mention indeed the multi-threaded VC6 is odd, I still don't believe it is due to the IDE, compiler progress is steady but slow, I have never seen a huge gain from one version tot he next. The single-thread results being equal suggests very similar code was generated, and the two-threaded VS05 being twice as fast tells us your multi-threading seems to work just fine. I can't imagine they ever wasted up to 1 second in launching some threads. Maybe the runtime libraries were structured quite differently, causing other/more/larger DLL files to be loaded after you started your timing. How did you measure these times? how did you build and run? Here are some experiment suggestions: - enclose code in a for(i=0; i<10; i++) loop and measure 10 executions inside a single process, watching for variance due to code loading, memory allocation effects, and the like. (I always do this, I have seen too many weird things going on in a first execution) - release build - run outside VS :)
Luc Pattyn [Forum Guidelines] [My Articles]
The quality and detail of your question reflects on the effectiveness of the help you are likely to get. Show formatted code inside PRE tags, and give clear symptoms when describing a problem.
Luc Pattyn wrote:
How did you measure these times? how did you build and run?
{ CTimer[^] t("Sobel"); Sobel(....); } start the test app, wait a couple of seconds for things to settle down, etc.. release build, yes. tried all the various compiler optimizations, etc.. i didn't put this one in a loop because it took long enough that i wasn't up against the resolution limit of GetTickCount. i understand the need to time multiple iterations in order to get a decent average, but i did run it dozens of times, with consistent results. weird stuff. anyway, i guess i'll just build the final DLL in VS05 :)
-
Hi Chris, FWIW: there was a time VS was terribly slow at dealing with the first exception in an app. Not sure which versions and languages, but it was not very constant and in the order of seconds. Maybe that is what you're having? :)
Luc Pattyn [Forum Guidelines] [My Articles]
The quality and detail of your question reflects on the effectiveness of the help you are likely to get. Show formatted code inside PRE tags, and give clear symptoms when describing a problem.
hmm... i don't recall seeing any exceptions in the output window, but i'll check again.
-
Luc Pattyn wrote:
How did you measure these times? how did you build and run?
{ CTimer[^] t("Sobel"); Sobel(....); } start the test app, wait a couple of seconds for things to settle down, etc.. release build, yes. tried all the various compiler optimizations, etc.. i didn't put this one in a loop because it took long enough that i wasn't up against the resolution limit of GetTickCount. i understand the need to time multiple iterations in order to get a decent average, but i did run it dozens of times, with consistent results. weird stuff. anyway, i guess i'll just build the final DLL in VS05 :)
I have seen performance increase as much as 25% in certain functions when porting old code up to VS2005. In the days of VC6 I could use __asm and optimize algorithms better than the compiler. I have a hard time beating VC2005/VC2008 compiler optimizations now. I pretty much gave up on trying to beat the latest compilers. You could use the /FAs compiler switch[^] and compare the assembly generated by each compiler to see the difference. Best Wishes, -David Delaune
-
i have a bit of C++ code that does some pretty heavy-duty image processing work. sadly, the code wasn't running as fast as i'd hoped, so, after doing all the optimizations i could think of, i decided to try multi-threading it. i broke the job into strips, with each thread taking a horizontal strip of the image. the algorithm is the same, and 99% of the code is the same; the only difference is some bounds-checking code that the multi-threaded version skips. there is no communication between threads. the multi-threaded version was more than twice as slow as the single-threaded version - i kept the single-threaded version of the code intact, for benchmarking. i wrote it using VC6 (because I prefer the IDE to the VS.Net IDEs). so, i opened the project in VS05, built and ran it - hoping the VS05 compiler had some better optimization. but no, the single-threaded version ran exactly as fast as it did when built with VC6. on the other hand, in VS05, the multi-threaded version ran twice as fast as the single-threaded version. again: the multi-threaded code went from twice as slow to twice as fast, simply by changing the compiler, but the single-threaded version didn't change at all. these projects are all built with the multi-threaded CLR, same optimizations (controlled by pragmas in the code). the single and multi-threaded function live in the same module (same class, in fact). so, something changed between VC6 and VS05 which affects how fast threads run. does anybody know what that change was ?
The heap manager for VS2005 is faster than that for Visual 6. I ran across this a few weeks ago. How thread local storage is handled is different fore VS2005 as well, though I doubt this would affect performance in a significant way. You also need to verify you really are testing what you think you're testing. How fast is the second thread even executing? Due to small changes in the CRT, the new thread may not be getting a quanta as quickly under 6. Another possibility is that some paging is happening under 6 that isn't happening under 05 (seen that also.)
Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke
-
The heap manager for VS2005 is faster than that for Visual 6. I ran across this a few weeks ago. How thread local storage is handled is different fore VS2005 as well, though I doubt this would affect performance in a significant way. You also need to verify you really are testing what you think you're testing. How fast is the second thread even executing? Due to small changes in the CRT, the new thread may not be getting a quanta as quickly under 6. Another possibility is that some paging is happening under 6 that isn't happening under 05 (seen that also.)
Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke
Joe Woodbury wrote:
How fast is the second thread even executing?
that's an interesting question. i never thought to time each thread independently.
-
I have seen performance increase as much as 25% in certain functions when porting old code up to VS2005. In the days of VC6 I could use __asm and optimize algorithms better than the compiler. I have a hard time beating VC2005/VC2008 compiler optimizations now. I pretty much gave up on trying to beat the latest compilers. You could use the /FAs compiler switch[^] and compare the assembly generated by each compiler to see the difference. Best Wishes, -David Delaune
Yeah I'd say its hard to beat the machine in this case.