Adventures in Async
-
Never bothered with Async programming before since I never needed it. But now I'm having to take care of a weekly delivery of an 80 GB (eighty gigabyte) large XML-file. The parsing and saving 10 million records to 30 different tables in a database takes more than an hour and there's no simple optimization left to do. But I only use one kernel in the processor, so let's go parallell, it'll be fun learning. Right? Easiest part is bulk copying to the database in parallel. Easy enough but it only shaves five minutes from the total time. This is not where the biggest bottleneck is. The biggest bottleneck is the actual parsing of the XML. I don't want to rework the whole application into using locks and thread-safe collections so I decide to split the work vertically instead. Add a task for every collection of data. Also easy enough, now the processor is working close to 100%, but it takes twice as long. :wtf: Apparently the creation of tasks has more overhead than the parsing of the data itself. :laugh: No shortcuts for me today. Back to the drawing board.
Wrong is evil and must be defeated. - Jeff Ello
Are you loading the entire file into an
XmlDocument
orXDocument
? It might be better to stream the file using theXmlReader
class. It's a lot more work for you, but it should improve the performance. How to perform streaming transform of large XML documents (C#) | Microsoft Docs[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
-
Kornfeld Eliyahu Peter wrote:
a place where data transfer is done in 80GB XML files it seems to be feasible
Well, that's governments for you. :rolleyes:
Wrong is evil and must be defeated. - Jeff Ello
No way! I work with gov and they just bright and smooth... cream-dela-cream... (today is the 5th week I'm waiting for a version update - still there are personnel to sign it)
"The only place where Success comes before Work is in the dictionary." Vidal Sassoon, 1928 - 2012
-
OriginalGriff wrote:
Oh yes, as soon as your thread count exceeds the core count, you are going to get some slowdown.
Didn't even do that. :) I'm fully aware of where I wen't wrong. I posted it for netizens of the lounge to have a laugh on my behalf. In this case the specific problem is that the piece of work is smaller than the cost of creating tasks. And my error in the bigger picture is that one cannot simply convert a task running in sync to one running in async. It has to be purpose built.
Wrong is evil and must be defeated. - Jeff Ello
Jörgen Andersson wrote:
netizens of the lounge to have a laugh on my behalf.
We wouldn't do that! :laugh:
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony AntiTwitter: @DalekDave is now a follower!
-
Are you loading the entire file into an
XmlDocument
orXDocument
? It might be better to stream the file using theXmlReader
class. It's a lot more work for you, but it should improve the performance. How to perform streaming transform of large XML documents (C#) | Microsoft Docs[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
I'm using an XMLReader to chop up the filestream into an XDocument for every record. Using an XMLReader all the way became to much work, handling null nodes and such stuff.
Wrong is evil and must be defeated. - Jeff Ello
-
Never bothered with Async programming before since I never needed it. But now I'm having to take care of a weekly delivery of an 80 GB (eighty gigabyte) large XML-file. The parsing and saving 10 million records to 30 different tables in a database takes more than an hour and there's no simple optimization left to do. But I only use one kernel in the processor, so let's go parallell, it'll be fun learning. Right? Easiest part is bulk copying to the database in parallel. Easy enough but it only shaves five minutes from the total time. This is not where the biggest bottleneck is. The biggest bottleneck is the actual parsing of the XML. I don't want to rework the whole application into using locks and thread-safe collections so I decide to split the work vertically instead. Add a task for every collection of data. Also easy enough, now the processor is working close to 100%, but it takes twice as long. :wtf: Apparently the creation of tasks has more overhead than the parsing of the data itself. :laugh: No shortcuts for me today. Back to the drawing board.
Wrong is evil and must be defeated. - Jeff Ello
If the parsing can be partitioned into n subproblems, where n is the number of cores, then I would consider creating n daemons and locking each one into its own core. If any of them block, offloading the blocking operations to thread pools might help. Partitioning the problem will help to reduce semaphore contention and cache collisions. But I haven't had to populate a large database this way, so I could be full of shite. :-D
Robust Services Core | Software Techniques for Lemmings | Articles
-
If the parsing can be partitioned into n subproblems, where n is the number of cores, then I would consider creating n daemons and locking each one into its own core. If any of them block, offloading the blocking operations to thread pools might help. Partitioning the problem will help to reduce semaphore contention and cache collisions. But I haven't had to populate a large database this way, so I could be full of shite. :-D
Robust Services Core | Software Techniques for Lemmings | Articles
This is exactly what I didn't want to have to learn. :laugh: At least all proper databases already handle parallel execution properly.
Wrong is evil and must be defeated. - Jeff Ello
-
No way! I work with gov and they just bright and smooth... cream-dela-cream... (today is the 5th week I'm waiting for a version update - still there are personnel to sign it)
"The only place where Success comes before Work is in the dictionary." Vidal Sassoon, 1928 - 2012
In my case I actually understand them, we're not the only customer on this data, so for them it's just easier to upload a weekly XML file to an ftp-server. And it's not even my own government in this case. I don't understand Danish, and Danes take offence if I speak English to them. (Quite rightly so I might add :) ) So if I want support I need to employ Johnny.
Wrong is evil and must be defeated. - Jeff Ello
-
Never bothered with Async programming before since I never needed it. But now I'm having to take care of a weekly delivery of an 80 GB (eighty gigabyte) large XML-file. The parsing and saving 10 million records to 30 different tables in a database takes more than an hour and there's no simple optimization left to do. But I only use one kernel in the processor, so let's go parallell, it'll be fun learning. Right? Easiest part is bulk copying to the database in parallel. Easy enough but it only shaves five minutes from the total time. This is not where the biggest bottleneck is. The biggest bottleneck is the actual parsing of the XML. I don't want to rework the whole application into using locks and thread-safe collections so I decide to split the work vertically instead. Add a task for every collection of data. Also easy enough, now the processor is working close to 100%, but it takes twice as long. :wtf: Apparently the creation of tasks has more overhead than the parsing of the data itself. :laugh: No shortcuts for me today. Back to the drawing board.
Wrong is evil and must be defeated. - Jeff Ello
yup - learn't the hard way, 1st identify where the program uses it's resources
-
I'm using an XMLReader to chop up the filestream into an XDocument for every record. Using an XMLReader all the way became to much work, handling null nodes and such stuff.
Wrong is evil and must be defeated. - Jeff Ello
I wrote a command line app that imports a NESSUS security scan XML data file - the largest I've seen to date is about 8gb. We import the data into a SQL server database. It's not multi-threaded at all that I recall. I do remember that the file was too big for
XDoument
to work. I feel your pain.".45 ACP - because shooting twice is just silly" - JSOP, 2010
-----
You can never have too much ammo - unless you're swimming, or on fire. - JSOP, 2010
-----
When you pry the gun from my cold dead hands, be careful - the barrel will be very hot. - JSOP, 2013 -
Never bothered with Async programming before since I never needed it. But now I'm having to take care of a weekly delivery of an 80 GB (eighty gigabyte) large XML-file. The parsing and saving 10 million records to 30 different tables in a database takes more than an hour and there's no simple optimization left to do. But I only use one kernel in the processor, so let's go parallell, it'll be fun learning. Right? Easiest part is bulk copying to the database in parallel. Easy enough but it only shaves five minutes from the total time. This is not where the biggest bottleneck is. The biggest bottleneck is the actual parsing of the XML. I don't want to rework the whole application into using locks and thread-safe collections so I decide to split the work vertically instead. Add a task for every collection of data. Also easy enough, now the processor is working close to 100%, but it takes twice as long. :wtf: Apparently the creation of tasks has more overhead than the parsing of the data itself. :laugh: No shortcuts for me today. Back to the drawing board.
Wrong is evil and must be defeated. - Jeff Ello
More proof that some people have real problems. So stop complaining people, you could be Jörgen today.
-
Never bothered with Async programming before since I never needed it. But now I'm having to take care of a weekly delivery of an 80 GB (eighty gigabyte) large XML-file. The parsing and saving 10 million records to 30 different tables in a database takes more than an hour and there's no simple optimization left to do. But I only use one kernel in the processor, so let's go parallell, it'll be fun learning. Right? Easiest part is bulk copying to the database in parallel. Easy enough but it only shaves five minutes from the total time. This is not where the biggest bottleneck is. The biggest bottleneck is the actual parsing of the XML. I don't want to rework the whole application into using locks and thread-safe collections so I decide to split the work vertically instead. Add a task for every collection of data. Also easy enough, now the processor is working close to 100%, but it takes twice as long. :wtf: Apparently the creation of tasks has more overhead than the parsing of the data itself. :laugh: No shortcuts for me today. Back to the drawing board.
Wrong is evil and must be defeated. - Jeff Ello
I was going to say - size of the work done in each task is key... But the underlying technology can also have an effect, by reducing the cost of task creation. If you're using a work queue on top of a thread pool, you're not creating a thread for each task, you're pushing/popping tasks on and off a queue. I created a [little tool to detect duplicate files](https://github.com/studoot/duplicate-finder) using that sort of parallelism. It contains two main areas of parallelism: 1. The [file search library](https://docs.rs/ignore/0.4.16/ignore/) that I use adds a new task for each directory it sees. Each task processes just the files that are immediate children of the directory the task was created for. 2. The detection of duplicates is split so that each task hashes a group of files that have the same size. This is performed using a [data parallelism library](https://docs.rs/rayon/1.3.1/rayon/), which makes parallelising things very easy. The amount of speedup I get isn't anywhere near the number of processor cores in use (I get a factor of just over two speedup on an eight core machine), but I think that the amount of IO being done serialises the processing to a certain degree. Benchmarking [ripgrep](https://blog.burntsushi.net/ripgrep/), another tool that uses similar parallelism, shows that running with 8 threads (on 8 logical/4 physical cores) is just over 3x faster than using 1.
Java, Basic, who cares - it's all a bunch of tree-hugging hippy cr*p
-
Never bothered with Async programming before since I never needed it. But now I'm having to take care of a weekly delivery of an 80 GB (eighty gigabyte) large XML-file. The parsing and saving 10 million records to 30 different tables in a database takes more than an hour and there's no simple optimization left to do. But I only use one kernel in the processor, so let's go parallell, it'll be fun learning. Right? Easiest part is bulk copying to the database in parallel. Easy enough but it only shaves five minutes from the total time. This is not where the biggest bottleneck is. The biggest bottleneck is the actual parsing of the XML. I don't want to rework the whole application into using locks and thread-safe collections so I decide to split the work vertically instead. Add a task for every collection of data. Also easy enough, now the processor is working close to 100%, but it takes twice as long. :wtf: Apparently the creation of tasks has more overhead than the parsing of the data itself. :laugh: No shortcuts for me today. Back to the drawing board.
Wrong is evil and must be defeated. - Jeff Ello
why are u even parsing xml files and that too 80gb !!! and then saving it to the database !!! .. u could try to use the sql server bulk import tools to do this and avoid programming such stuff all together...
Caveat Emptor. "Progress doesn't come from early risers – progress is made by lazy men looking for easier ways to do things." Lazarus Long
-
Never bothered with Async programming before since I never needed it. But now I'm having to take care of a weekly delivery of an 80 GB (eighty gigabyte) large XML-file. The parsing and saving 10 million records to 30 different tables in a database takes more than an hour and there's no simple optimization left to do. But I only use one kernel in the processor, so let's go parallell, it'll be fun learning. Right? Easiest part is bulk copying to the database in parallel. Easy enough but it only shaves five minutes from the total time. This is not where the biggest bottleneck is. The biggest bottleneck is the actual parsing of the XML. I don't want to rework the whole application into using locks and thread-safe collections so I decide to split the work vertically instead. Add a task for every collection of data. Also easy enough, now the processor is working close to 100%, but it takes twice as long. :wtf: Apparently the creation of tasks has more overhead than the parsing of the data itself. :laugh: No shortcuts for me today. Back to the drawing board.
Wrong is evil and must be defeated. - Jeff Ello
Welcome to the cool club though. Ladies can't resist an async coder. #science
Jeremy Falcon
-
More proof that some people have real problems. So stop complaining people, you could be Jörgen today.
Ron Anders wrote:
So stop complaining people, you could be Jörgen today.
...and have no toilet paper.
Jeremy Falcon
-
why are u even parsing xml files and that too 80gb !!! and then saving it to the database !!! .. u could try to use the sql server bulk import tools to do this and avoid programming such stuff all together...
Caveat Emptor. "Progress doesn't come from early risers – progress is made by lazy men looking for easier ways to do things." Lazarus Long
Because I want to have the data extracted into normalized tables.
Wrong is evil and must be defeated. - Jeff Ello
-
Welcome to the cool club though. Ladies can't resist an async coder. #science
Jeremy Falcon
That's seriously the best answer today. :-D
Wrong is evil and must be defeated. - Jeff Ello
-
That's seriously the best answer today. :-D
Wrong is evil and must be defeated. - Jeff Ello
:-D
Jeremy Falcon
-
More proof that some people have real problems. So stop complaining people, you could be Jörgen today.
Isn't it enough if I'm being me?
Wrong is evil and must be defeated. - Jeff Ello
-
Because I want to have the data extracted into normalized tables.
Wrong is evil and must be defeated. - Jeff Ello
if u have sql server there is SSIS anyway... [Importing XML documents using SQL Server Integration Services](https://www.mssqltips.com/sqlservertip/3141/importing-xml-documents-using-sql-server-integration-services/)
Caveat Emptor. "Progress doesn't come from early risers – progress is made by lazy men looking for easier ways to do things." Lazarus Long
-
if u have sql server there is SSIS anyway... [Importing XML documents using SQL Server Integration Services](https://www.mssqltips.com/sqlservertip/3141/importing-xml-documents-using-sql-server-integration-services/)
Caveat Emptor. "Progress doesn't come from early risers – progress is made by lazy men looking for easier ways to do things." Lazarus Long
I've missed out on that possibility completely. A bit late now, but I'll take a look at it anyway. :thumbsup:
Wrong is evil and must be defeated. - Jeff Ello