Get dynamic web content
-
I've written a little web crawler in VC++, which grabs financial indices and quotes from different websites and shows them. If the sources are plain html, everything is fine. Now I've got a website which shows the quotes dynamically (http://www.forexpf.ru/quote_show.php[^]). IMHO there are 2 ways to get the information extracted: 1. grab the page as image, make OCR and extract the info 2. load the page into a browser control to build the content, copy the content text (into clipboard) and extract information #1 works in general, but ocr actually isn't accurate enough. #2: Are there any examples to show the handling of clipboard? On the other hand: Use of clipboard wouldn't be my first choice because the grab process is repeated automatically in background and with use of clipboard other applications running would be influenced. Are there any other ideas to solve the problem? TIA M.
-
I've written a little web crawler in VC++, which grabs financial indices and quotes from different websites and shows them. If the sources are plain html, everything is fine. Now I've got a website which shows the quotes dynamically (http://www.forexpf.ru/quote_show.php[^]). IMHO there are 2 ways to get the information extracted: 1. grab the page as image, make OCR and extract the info 2. load the page into a browser control to build the content, copy the content text (into clipboard) and extract information #1 works in general, but ocr actually isn't accurate enough. #2: Are there any examples to show the handling of clipboard? On the other hand: Use of clipboard wouldn't be my first choice because the grab process is repeated automatically in background and with use of clipboard other applications running would be influenced. Are there any other ideas to solve the problem? TIA M.
Mathefreak wrote:
2. load the page into a browser control to build the content, copy the content text (into clipboard) and extract information
Try using
IWebBrowser2, IHTMLDocument, IHTMLElement
, and related interfaces. Regards, Paresh. -
I've written a little web crawler in VC++, which grabs financial indices and quotes from different websites and shows them. If the sources are plain html, everything is fine. Now I've got a website which shows the quotes dynamically (http://www.forexpf.ru/quote_show.php[^]). IMHO there are 2 ways to get the information extracted: 1. grab the page as image, make OCR and extract the info 2. load the page into a browser control to build the content, copy the content text (into clipboard) and extract information #1 works in general, but ocr actually isn't accurate enough. #2: Are there any examples to show the handling of clipboard? On the other hand: Use of clipboard wouldn't be my first choice because the grab process is repeated automatically in background and with use of clipboard other applications running would be influenced. Are there any other ideas to solve the problem? TIA M.
Mathefreak wrote:
If the sources are plain html, everything is fine. Now I've got a website which shows the quotes dynamically (http://www.forexpf.ru/quote\_show.php\[^\]).
But the tables are still HTML. Unless I am not understanding, isn't row #3 of the upper-left table always "NASD Comp?" Or are you saying that the first column in each table continually changes?
"A good athlete is the result of a good and worthy opponent." - David Crow
"To have a respect for ourselves guides our morals; to have deference for others governs our manners." - Laurence Sterne
-
I've written a little web crawler in VC++, which grabs financial indices and quotes from different websites and shows them. If the sources are plain html, everything is fine. Now I've got a website which shows the quotes dynamically (http://www.forexpf.ru/quote_show.php[^]). IMHO there are 2 ways to get the information extracted: 1. grab the page as image, make OCR and extract the info 2. load the page into a browser control to build the content, copy the content text (into clipboard) and extract information #1 works in general, but ocr actually isn't accurate enough. #2: Are there any examples to show the handling of clipboard? On the other hand: Use of clipboard wouldn't be my first choice because the grab process is repeated automatically in background and with use of clipboard other applications running would be influenced. Are there any other ideas to solve the problem? TIA M.
By "dynamically", I assume you mean you can't rely on the order of information? If so, you could scrape tuples (eg: "NASD100=1888.08") instead of assuming the location of specific entries in the table. Btw, I wrote this[^] in order to build this[^]. /ravi
This is your brain on Celcius Home | Music | Articles | Freeware | Trips ravib(at)ravib(dot)com
-
By "dynamically", I assume you mean you can't rely on the order of information? If so, you could scrape tuples (eg: "NASD100=1888.08") instead of assuming the location of specific entries in the table. Btw, I wrote this[^] in order to build this[^]. /ravi
This is your brain on Celcius Home | Music | Articles | Freeware | Trips ravib(at)ravib(dot)com
The only things which changes in the resulting webpage are the quote. My aim is to get the quote for DAX (7th row in upper left table). Are there any example to use the IWebBrowser2 interface to get the information. TIA M.
-
The only things which changes in the resulting webpage are the quote. My aim is to get the quote for DAX (7th row in upper left table). Are there any example to use the IWebBrowser2 interface to get the information. TIA M.
-
Hi Ravi, it's not only plain html, unfortunately. There are some java functions embedded to grab the actual quotes. Nevertheless, after searching around the net a bit, I'm proudly present the solution, which works for me :-D Sample application: - simple MFC-Dialog - one Webbrowser control (m_WebBrowserCtrl) - website is loaded and refreshed by button click - by clicking on a button the content of the site (plain text, not the html source) is copied into a CString variable to parse the data.
void CWebbrowser_TestDlg::OnCopy() { IHTMLDocument2* m_pHTMLDocument2; LPDISPATCH lpDispatch; lpDispatch = m_WebBrowserCtrl.GetDocument(); HRESULT hr; if (lpDispatch) { hr = lpDispatch->QueryInterface(IID_IHTMLDocument2, (LPVOID*)&m_pHTMLDocument2); lpDispatch->Release(); ASSERT(SUCCEEDED(hr)); } CString sText; IHTMLElement *iSource; BSTR bstrSource; m_pHTMLDocument2->get_body(&iSource); iSource->get_outerText(&bstrSource); sText = bstrSource; MessageBox(sText); }
Comments are welcome. Next step is to use the code in my application, but that seems to be easy. Greets M.