How can i make this program Parrallel ?
-
foreach (var fname in fnames)
{
// Open file, read in file contents,
// process each line one at a time
// split into words, add word to index if not already present
// increment occurrence count for this file and record line number.
string[] lines = File.ReadAllLines(fname);for (int i = 0; i < lines.Length; i++) { string\[\] wordsInLineRaw = Regex.Split(lines\[i\], @"\\W+"); string\[\] wordsInLine = new string\[wordsInLineRaw.Length\]; for (int p = 0; p < wordsInLineRaw.Length; p++) { wordsInLine\[p\] = wordsInLineRaw\[p\].ToLower(); } // Process each word found in line for (int j = 0; j < wordsInLine.Length; j++) { string w = wordsInLine\[j\]; // If word in stop list break if (stopWords.Contains(w)) continue; //if word not in index add it if (!index.ContainsKey(w)) { if (worker.CancellationPending) { } //e.Cancel(); index.Add(w, new IndexEntry()); } // Update info about word index\[w\].UpdateInfo(fname, i); } }
-
foreach (var fname in fnames)
{
// Open file, read in file contents,
// process each line one at a time
// split into words, add word to index if not already present
// increment occurrence count for this file and record line number.
string[] lines = File.ReadAllLines(fname);for (int i = 0; i < lines.Length; i++) { string\[\] wordsInLineRaw = Regex.Split(lines\[i\], @"\\W+"); string\[\] wordsInLine = new string\[wordsInLineRaw.Length\]; for (int p = 0; p < wordsInLineRaw.Length; p++) { wordsInLine\[p\] = wordsInLineRaw\[p\].ToLower(); } // Process each word found in line for (int j = 0; j < wordsInLine.Length; j++) { string w = wordsInLine\[j\]; // If word in stop list break if (stopWords.Contains(w)) continue; //if word not in index add it if (!index.ContainsKey(w)) { if (worker.CancellationPending) { } //e.Cancel(); index.Add(w, new IndexEntry()); } // Update info about word index\[w\].UpdateInfo(fname, i); } }
There are a number of ways you can approach this. The simplest is to execute the body of your outer loop for multiple files at the same time. You can use the Task Parallel Library that is part of .Net for this. The 'Task' classes let you run code in parallel and wait for results at the end. If you choose to do that, the problem is now "How do I merge results of concurrent indexes into one big index" Some possible solutions are:
- Use a lock to protect access to the index
- Create separate indexes and merge them at the end.
I would go with the second because it uses less shared state, and has less chance of me introducing a difficult to find bug
var indexFilesTasks = (from fname in fnames
select Task.Run(() => IndexSingleFile(fname, stopWords, () => worker.CancellationPending))).ToArray();Task.WaitAll(indexFilesTasks);
index = MergeIndexes(from t in indexFilesTasks
select t.Result);The IndexSingleFile() method contains all your code in your foreach (var fname in fnames) loop. The MergeIndexes() method takes multiple Dictionary() structures, and merges them together. Below is the code I used for these. Note this isn't the "best" way, but just one way of approaching the problem.
Dictionary<string,IndexEntry> IndexSingleFile(string filename, HashSet<string> stopWords, Func<bool> isCancelling) {
var index = new Dictionary<string,IndexEntry>();// Open file, read in file contents, // process each line one at a time // split into words, add word to index if not already present // increment occurrence count for this file and record line number. string\[\] lines = File.ReadAllLines(filename); for (int i = 0; i < lines.Length; i++) { string\[\] wordsInLineRaw = Regex.Split(lines\[i\], @"\\W+"); string\[\] wordsInLine = new string\[wordsInLineRaw.Length\]; for (int p = 0; p < wordsInLineRaw.Length; p++) { wordsInLine\[p\] = wordsInLineRaw\[p\].ToLower(); } // Process each word found in line for (int j = 0; j < wordsInLine.Length; j++) { string w = wordsInLine\[j\]; // If word in stop list break if (stopWords.Contains(w)) continue; //if word not in index add it if (!index.ContainsKey(w)) { i
-
There are a number of ways you can approach this. The simplest is to execute the body of your outer loop for multiple files at the same time. You can use the Task Parallel Library that is part of .Net for this. The 'Task' classes let you run code in parallel and wait for results at the end. If you choose to do that, the problem is now "How do I merge results of concurrent indexes into one big index" Some possible solutions are:
- Use a lock to protect access to the index
- Create separate indexes and merge them at the end.
I would go with the second because it uses less shared state, and has less chance of me introducing a difficult to find bug
var indexFilesTasks = (from fname in fnames
select Task.Run(() => IndexSingleFile(fname, stopWords, () => worker.CancellationPending))).ToArray();Task.WaitAll(indexFilesTasks);
index = MergeIndexes(from t in indexFilesTasks
select t.Result);The IndexSingleFile() method contains all your code in your foreach (var fname in fnames) loop. The MergeIndexes() method takes multiple Dictionary() structures, and merges them together. Below is the code I used for these. Note this isn't the "best" way, but just one way of approaching the problem.
Dictionary<string,IndexEntry> IndexSingleFile(string filename, HashSet<string> stopWords, Func<bool> isCancelling) {
var index = new Dictionary<string,IndexEntry>();// Open file, read in file contents, // process each line one at a time // split into words, add word to index if not already present // increment occurrence count for this file and record line number. string\[\] lines = File.ReadAllLines(filename); for (int i = 0; i < lines.Length; i++) { string\[\] wordsInLineRaw = Regex.Split(lines\[i\], @"\\W+"); string\[\] wordsInLine = new string\[wordsInLineRaw.Length\]; for (int p = 0; p < wordsInLineRaw.Length; p++) { wordsInLine\[p\] = wordsInLineRaw\[p\].ToLower(); } // Process each word found in line for (int j = 0; j < wordsInLine.Length; j++) { string w = wordsInLine\[j\]; // If word in stop list break if (stopWords.Contains(w)) continue; //if word not in index add it if (!index.ContainsKey(w)) { i
Phil Martin wrote:
The simplest is to read in multiple files at the same time
Wow, where do you get such multi-tasking harddisks from? :-D No, seriously: that step is likely to be the slowest of all, hence hardly anything can be gained from parallelizing. Only when processing the file contents takes a relatively high timespan, parallelizing might help, but then I'd uncouple reading and processing (i.e. read the file into memory and then send it to the processing function, with the next file read being started at that moment - note: may require large memory). The locks on the index may cause some extra overhead, i.e. slow down again.
-
Phil Martin wrote:
The simplest is to read in multiple files at the same time
Wow, where do you get such multi-tasking harddisks from? :-D No, seriously: that step is likely to be the slowest of all, hence hardly anything can be gained from parallelizing. Only when processing the file contents takes a relatively high timespan, parallelizing might help, but then I'd uncouple reading and processing (i.e. read the file into memory and then send it to the processing function, with the next file read being started at that moment - note: may require large memory). The locks on the index may cause some extra overhead, i.e. slow down again.
I'm not sure Phil meant to read the files in parallel then continue processing them on a single thread. By parallelizing at the top level, the thread reading the file goes on to processing it so I'd say his assertion was correct but probably better stated as 'read in and process multiple files at the same time'. It's an odd thing to do anyway. If I were to try and improve the performance of this code the first thing I'd do is get rid of the ReadAllLines and switch to streams before even thinking of going parallel.
Regards, Rob Philpott.
-
I'm not sure Phil meant to read the files in parallel then continue processing them on a single thread. By parallelizing at the top level, the thread reading the file goes on to processing it so I'd say his assertion was correct but probably better stated as 'read in and process multiple files at the same time'. It's an odd thing to do anyway. If I were to try and improve the performance of this code the first thing I'd do is get rid of the ReadAllLines and switch to streams before even thinking of going parallel.
Regards, Rob Philpott.
Thanks for clarifying that for me Rob. And I agree with you, there are many other things I'd do first and differently. I was aiming for minimum effort on the developer's part with minimum side effects. I'll edit the post and try to be a little less ambiguous about my intent.
-
Phil Martin wrote:
The simplest is to read in multiple files at the same time
Wow, where do you get such multi-tasking harddisks from? :-D No, seriously: that step is likely to be the slowest of all, hence hardly anything can be gained from parallelizing. Only when processing the file contents takes a relatively high timespan, parallelizing might help, but then I'd uncouple reading and processing (i.e. read the file into memory and then send it to the processing function, with the next file read being started at that moment - note: may require large memory). The locks on the index may cause some extra overhead, i.e. slow down again.
Ha! I wish I had multitasking hard disks! My choice of wording was poor, and I hope the code clarified the intent, i.e. choosing the simplest point in the original code to parallelize. Will the changes be optimal? No. Will they be parallel with significant and measurable difference? Yes. And yes, there are many other options, but I think this was one of the simpler approaches to understand. If they already understood ways to set up a dataflow to achieve the task, I'm pretty sure they would have already done it that way.
-
Ha! I wish I had multitasking hard disks! My choice of wording was poor, and I hope the code clarified the intent, i.e. choosing the simplest point in the original code to parallelize. Will the changes be optimal? No. Will they be parallel with significant and measurable difference? Yes. And yes, there are many other options, but I think this was one of the simpler approaches to understand. If they already understood ways to set up a dataflow to achieve the task, I'm pretty sure they would have already done it that way.
Minimum changes would have been to just use Parallel.For on the outer loop and switch to a ConcurrentDictionary or use locks around the regular Dictionary. Your overall solution is overly complicated (no offense). Any time you save by parallelizing the main loop will be destroyed (and then some) by the merging loop.
-
Minimum changes would have been to just use Parallel.For on the outer loop and switch to a ConcurrentDictionary or use locks around the regular Dictionary. Your overall solution is overly complicated (no offense). Any time you save by parallelizing the main loop will be destroyed (and then some) by the merging loop.
Thanks for taking the time to reply. And I agree, parallel.for would have been the least invasive. And no offense taken!