Multi-threaded MFC Webcrawler (VC++)
-
Hi there, I'm in the process of writing a multi-threaded webcrawler using VC++ 6 and MFC. I've got the basics down, but I can't help feeling like I'm going about it the wrong way. I've got a Doc/View set up and when the user chooses some options (base URL / how many threads to launch), the document launches the first thread. To do that, it sends a message (using postthreadmessage) to the worker. The worker then goes and downloads the first page, parses it for links and sends a message back to the document via the view as I wasn't able to send them directly to the document... As a side note, that process feels a little clunky. It seems logical that you'd probably not want to send messages direct to the document, but I couldn't see where else to do it; afterall, the document's storing all the links the workers download. ...anyway, once the document knows there's an idle worker, it gets all the links from the worker (calls a member function on it to retrieve the data - should be safe as the worker is idling now?) and adds them onto the (end of) master list. Since the document knows there's now a free worker, it pulls off the next link on the master link and launches the worker off again. Does this sound like a reasonable way of doing things? I'm not sure quite why, but it feels a little in-elegant, and I can't help feeling there's probably a better way of doing it? Any suggestions / comments would be most welcome! Cheers, Jon Success is 99% failure
-
Hi there, I'm in the process of writing a multi-threaded webcrawler using VC++ 6 and MFC. I've got the basics down, but I can't help feeling like I'm going about it the wrong way. I've got a Doc/View set up and when the user chooses some options (base URL / how many threads to launch), the document launches the first thread. To do that, it sends a message (using postthreadmessage) to the worker. The worker then goes and downloads the first page, parses it for links and sends a message back to the document via the view as I wasn't able to send them directly to the document... As a side note, that process feels a little clunky. It seems logical that you'd probably not want to send messages direct to the document, but I couldn't see where else to do it; afterall, the document's storing all the links the workers download. ...anyway, once the document knows there's an idle worker, it gets all the links from the worker (calls a member function on it to retrieve the data - should be safe as the worker is idling now?) and adds them onto the (end of) master list. Since the document knows there's now a free worker, it pulls off the next link on the master link and launches the worker off again. Does this sound like a reasonable way of doing things? I'm not sure quite why, but it feels a little in-elegant, and I can't help feeling there's probably a better way of doing it? Any suggestions / comments would be most welcome! Cheers, Jon Success is 99% failure
If I were you I'd consider writing/using a thread-pool. have the first thread start running with a job of downloading the first page. then perform your default processing on it which is to parse for links . then, fill the thread-pool's job-list with jobs per each url, and so on. add a depth count to each job so that you can limit them. before a thread fills up the thread-pool's job-list with more urls to investigate, have it post some data into another list the doc/view is incharge of so that the user can get a sense of 'whats happening'. p.s. - a thread pool is a simple structure of X threads that wait for 'jobs' to handle. they share a single 'list of pending jobs' and whenever a job exists in the list, an event is set and the first thread to catch it will be the one to remove that job from the list and handle it. it requiers some synchronization (locking list, waiting for input, waiting for all threads to destroy themselves, etc) but if you do it generalized it's worth it.