Need To Create a crawler/spider in vc++
-
very good.. now what is the problem??
I need to decide many things before starting the project,coz i am the only one responsible to make this project, so please tell me the initial guideline to start with,Like what i should use....win32 exe,win32 dll,com etc....which inter process communication logic i should use.....
Thanks A Ton Ash_VCPP
-
Wow, starting the working day with a smile is very good, my five. :)
If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
[My articles] -
Do you have any idea about crawler if yes then please provide me the way to start working its urgent...... :-O
Thanks A Ton Ash_VCPP
Ash_VCPP wrote:
Do you have any idea about crawler
Yes.
Ash_VCPP wrote:
then please provide me the way to start working its urgent......
Sorry, *urgent* questions automatically falls to the bottom of the stack (just a bit above *very urgent* questions). :)
If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
[My articles] -
Hi All, I have an urgent requirement to create a crawler by which i can be able to fetch data from a url, the ide should be vc++.
Thanks A Ton Ash_VCPP
As you may have seen from your response, it's not a very good question. 1/ You haven't actually asked a question - you've just told us you have work to do. While we are, of course, very happy for you, there's not much to answer. 2/ You've got quite a bit challenge, especially if your starting from scratch. 3/ You can break it down into several challenges... Handling delays, timeouts, gettinf HTPP pages, parsing them into links, etc. I've attached below some code I wrote years ago, grabbing a certain page from a specific URL every hour or so - an early RSS reader, essentially. It may help you with your search terms. There are other articles on codeproject grabbing information from web pages. John Simmons wrote one recently scraping information from a codeproject page. Good luck with your task! Iain.
DWORD WINAPI UpdatePageThread ( LPVOID lpParameter )
{
HWND hWnd = (HWND)lpParameter;DWORD dw, dwDelay = 100; HINTERNET hInternet, hIConnect, hIRequest; BOOL bSuccess; DWORD dwStatus, dwSize, dwIndex; PCHAR AcceptTypes \[\] = { "text/\*", NULL }; // Set up the query. hInternet = NULL; hIConnect = NULL; hIRequest = NULL; hInternet = ::InternetOpen ("OC UK Notify", INTERNET\_OPEN\_TYPE\_PRECONFIG, NULL, NULL, 0); if (hInternet) hIConnect = ::InternetConnect (hInternet, "www.overclock-uk.net", INTERNET\_DEFAULT\_HTTP\_PORT, "user", "pass", INTERNET\_SERVICE\_HTTP, 0, 1); if (hIConnect) { hIRequest = ::HttpOpenRequest (hIConnect, NULL, "update.ocuk", NULL, NULL, (const char \*\*)AcceptTypes, INTERNET\_FLAG\_NO\_CACHE\_WRITE | INTERNET\_FLAG\_NO\_COOKIES | INTERNET\_FLAG\_NO\_UI | INTERNET\_FLAG\_RELOAD | INTERNET\_FLAG\_NO\_AUTH, 1); } if (!hIRequest) // Raise an error? return 1; char buf \[4096\]; std::string Page; while (1) { dw = WaitForSingleObject (g\_hEventStop, dwDelay); if (dw != WAIT\_TIMEOUT) break;
// dwDelay = 30000; // Wait a minute before we try again.
dwDelay = 90 * 60000; // 3/2 hours.bSuccess = ::HttpSendRequest (hIRequest, NULL, 0, NULL, 0); if (!bSuccess) continue; // Try again in a while. dwSize = sizeof (DWORD); dwIndex = 0; bSuccess = ::HttpQueryInfo (hIRequest, HTTP\_QUERY\_STATUS\_CODE | HTTP\_QUERY\_FLAG\_NUMBER, &dwStatus, &dwSize, &dwIndex); if (!bSuccess) continue; dwStatus /= 100; // Just get the 2XX part. if (dwStatus != 2) continue; Page.erase (); while (1) { memset (buf, 0, sizeof (buf)); bSuccess = ::InternetReadFile (hIRequest, buf, sizeof (buf), &dw
-
As you may have seen from your response, it's not a very good question. 1/ You haven't actually asked a question - you've just told us you have work to do. While we are, of course, very happy for you, there's not much to answer. 2/ You've got quite a bit challenge, especially if your starting from scratch. 3/ You can break it down into several challenges... Handling delays, timeouts, gettinf HTPP pages, parsing them into links, etc. I've attached below some code I wrote years ago, grabbing a certain page from a specific URL every hour or so - an early RSS reader, essentially. It may help you with your search terms. There are other articles on codeproject grabbing information from web pages. John Simmons wrote one recently scraping information from a codeproject page. Good luck with your task! Iain.
DWORD WINAPI UpdatePageThread ( LPVOID lpParameter )
{
HWND hWnd = (HWND)lpParameter;DWORD dw, dwDelay = 100; HINTERNET hInternet, hIConnect, hIRequest; BOOL bSuccess; DWORD dwStatus, dwSize, dwIndex; PCHAR AcceptTypes \[\] = { "text/\*", NULL }; // Set up the query. hInternet = NULL; hIConnect = NULL; hIRequest = NULL; hInternet = ::InternetOpen ("OC UK Notify", INTERNET\_OPEN\_TYPE\_PRECONFIG, NULL, NULL, 0); if (hInternet) hIConnect = ::InternetConnect (hInternet, "www.overclock-uk.net", INTERNET\_DEFAULT\_HTTP\_PORT, "user", "pass", INTERNET\_SERVICE\_HTTP, 0, 1); if (hIConnect) { hIRequest = ::HttpOpenRequest (hIConnect, NULL, "update.ocuk", NULL, NULL, (const char \*\*)AcceptTypes, INTERNET\_FLAG\_NO\_CACHE\_WRITE | INTERNET\_FLAG\_NO\_COOKIES | INTERNET\_FLAG\_NO\_UI | INTERNET\_FLAG\_RELOAD | INTERNET\_FLAG\_NO\_AUTH, 1); } if (!hIRequest) // Raise an error? return 1; char buf \[4096\]; std::string Page; while (1) { dw = WaitForSingleObject (g\_hEventStop, dwDelay); if (dw != WAIT\_TIMEOUT) break;
// dwDelay = 30000; // Wait a minute before we try again.
dwDelay = 90 * 60000; // 3/2 hours.bSuccess = ::HttpSendRequest (hIRequest, NULL, 0, NULL, 0); if (!bSuccess) continue; // Try again in a while. dwSize = sizeof (DWORD); dwIndex = 0; bSuccess = ::HttpQueryInfo (hIRequest, HTTP\_QUERY\_STATUS\_CODE | HTTP\_QUERY\_FLAG\_NUMBER, &dwStatus, &dwSize, &dwIndex); if (!bSuccess) continue; dwStatus /= 100; // Just get the 2XX part. if (dwStatus != 2) continue; Page.erase (); while (1) { memset (buf, 0, sizeof (buf)); bSuccess = ::InternetReadFile (hIRequest, buf, sizeof (buf), &dw
-
Ash_VCPP wrote:
Do you have any idea about crawler
Yes.
Ash_VCPP wrote:
then please provide me the way to start working its urgent......
Sorry, *urgent* questions automatically falls to the bottom of the stack (just a bit above *very urgent* questions). :)
If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
[My articles] -
Hi Iain, Thanks for providing this important information and code, now i will try in this way and if found any difficulties then i will let you know...once again thanks for the reply..
Thanks A Ton Ash_VCPP
The website / page this code pointed to has long since gone, by the way! And take the error checking with heavy skepticism... Iain.
Codeproject MVP for C++, I can't believe it's for my lounge posts...
-
Hi All, I have an urgent requirement to create a crawler by which i can be able to fetch data from a url, the ide should be vc++.
Thanks A Ton Ash_VCPP
Hi Ash, You still need the code? If yes then please let me know.
-
Hi Ash, You still need the code? If yes then please let me know.
-
Hi All, I have an urgent requirement to create a crawler by which i can be able to fetch data from a url, the ide should be vc++.
Thanks A Ton Ash_VCPP
Ash_VCPP wrote:
I have an urgent requirement to create a crawler...
Care to define this?
"Old age is like a bank account. You withdraw later in life what you have deposited along the way." - Unknown
"Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons
-
Ash_VCPP wrote:
I have an urgent requirement to create a crawler...
Care to define this?
"Old age is like a bank account. You withdraw later in life what you have deposited along the way." - Unknown
"Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons
-
i got your point till some extent but i would be pleased if you can explain it more...
Thanks A Ton Ash_VCPP
Ash_VCPP wrote:
...i would be pleased if you can explain it more...
I believe that was the question I posed to you. The term "crawler" can take on several different meanings. What is yours?
"Old age is like a bank account. You withdraw later in life what you have deposited along the way." - Unknown
"Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons
-
Ash_VCPP wrote:
...i would be pleased if you can explain it more...
I believe that was the question I posed to you. The term "crawler" can take on several different meanings. What is yours?
"Old age is like a bank account. You withdraw later in life what you have deposited along the way." - Unknown
"Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons
-
basically i need an exe which can fetch data from any url and dump it to data base.....
Thanks A Ton Ash_VCPP
Ash_VCPP wrote:
...fetch data from any url...
Such as
URLDownloadToFile()
?"Old age is like a bank account. You withdraw later in life what you have deposited along the way." - Unknown
"Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons
-
Ash_VCPP wrote:
...fetch data from any url...
Such as
URLDownloadToFile()
?"Old age is like a bank account. You withdraw later in life what you have deposited along the way." - Unknown
"Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons