IHTMLDocument2 for Table Par [modified]
-
Hello Everyone, This is my 2nd Post regarding the Text extraction from the WebPages. In my Previous post David Crow suggested me use IHTMLDocument2 interface, In the code project depository I found this application Parsing HTML using MSHTML [] by Philip Patrick ..Extracts all the links in the WebPages Using HREF Tag, Can I use the same application to extract the Text from the Table from the WebPages ? I searched for Table tag from which I can extract the text, I did not find any information. Can anyone please tell me is it possible to use MSHTML using IHTMLDocument2 interface can I extract the Text from the <table> Tag. Thanking you, Naveen HS.
void CTestDlg::OnBgo()
{
UpdateData();
CWaitCursor wait;
if(m_csFilename.IsEmpty()){
AfxMessageBox(_T("Please specify the file to parse"));
return;
}
CFile f;//let's open file and read it into CString (u can use any buffer to read though if (f.Open(m\_csFilename, CFile::modeRead|CFile::shareDenyNone)) { m\_wndLinksList.ResetContent(); CString csWholeFile; f.Read(csWholeFile.GetBuffer(f.GetLength()), f.GetLength()); csWholeFile.ReleaseBuffer(f.GetLength()); f.Close(); //declare our MSHTML variables and create a document MSHTML::IHTMLDocument2Ptr pDoc; MSHTML::IHTMLDocument3Ptr pDoc3; MSHTML::IHTMLElementCollectionPtr pCollection; MSHTML::IHTMLElementPtr pElement; HRESULT hr = CoCreateInstance(CLSID\_HTMLDocument, NULL, CLSCTX\_INPROC\_SERVER, IID\_IHTMLDocument2, (void\*\*)&pDoc); //put the code into SAFEARRAY and write it into document SAFEARRAY\* psa = SafeArrayCreateVector(VT\_VARIANT, 0, 1); VARIANT \*param; bstr\_t bsData = (LPCTSTR)csWholeFile; hr = SafeArrayAccessData(psa, (LPVOID\*)¶m); param->vt = VT\_BSTR; param->bstrVal = (BSTR)bsData; hr = pDoc->write(psa); hr = pDoc->close(); SafeArrayDestroy(psa); //I'll use IHTMLDocument3 to retrieve tags. Note it is available only in IE5+ //If you don't want to use it, u can just run through all tags in HTML //(IHTMLDocument2->all property) pDoc3 = pDoc; //display HREF parameter of every link (A tag) in ListBox pCollection = pDoc3->getElementsByTagName(L"A"); for(long i=0; i<pCollection->length; i++){ pElement = pCollection->item(i, (long)0); if(pElement !
-
Hello Everyone, This is my 2nd Post regarding the Text extraction from the WebPages. In my Previous post David Crow suggested me use IHTMLDocument2 interface, In the code project depository I found this application Parsing HTML using MSHTML [] by Philip Patrick ..Extracts all the links in the WebPages Using HREF Tag, Can I use the same application to extract the Text from the Table from the WebPages ? I searched for Table tag from which I can extract the text, I did not find any information. Can anyone please tell me is it possible to use MSHTML using IHTMLDocument2 interface can I extract the Text from the <table> Tag. Thanking you, Naveen HS.
void CTestDlg::OnBgo()
{
UpdateData();
CWaitCursor wait;
if(m_csFilename.IsEmpty()){
AfxMessageBox(_T("Please specify the file to parse"));
return;
}
CFile f;//let's open file and read it into CString (u can use any buffer to read though if (f.Open(m\_csFilename, CFile::modeRead|CFile::shareDenyNone)) { m\_wndLinksList.ResetContent(); CString csWholeFile; f.Read(csWholeFile.GetBuffer(f.GetLength()), f.GetLength()); csWholeFile.ReleaseBuffer(f.GetLength()); f.Close(); //declare our MSHTML variables and create a document MSHTML::IHTMLDocument2Ptr pDoc; MSHTML::IHTMLDocument3Ptr pDoc3; MSHTML::IHTMLElementCollectionPtr pCollection; MSHTML::IHTMLElementPtr pElement; HRESULT hr = CoCreateInstance(CLSID\_HTMLDocument, NULL, CLSCTX\_INPROC\_SERVER, IID\_IHTMLDocument2, (void\*\*)&pDoc); //put the code into SAFEARRAY and write it into document SAFEARRAY\* psa = SafeArrayCreateVector(VT\_VARIANT, 0, 1); VARIANT \*param; bstr\_t bsData = (LPCTSTR)csWholeFile; hr = SafeArrayAccessData(psa, (LPVOID\*)¶m); param->vt = VT\_BSTR; param->bstrVal = (BSTR)bsData; hr = pDoc->write(psa); hr = pDoc->close(); SafeArrayDestroy(psa); //I'll use IHTMLDocument3 to retrieve tags. Note it is available only in IE5+ //If you don't want to use it, u can just run through all tags in HTML //(IHTMLDocument2->all property) pDoc3 = pDoc; //display HREF parameter of every link (A tag) in ListBox pCollection = pDoc3->getElementsByTagName(L"A"); for(long i=0; i<pCollection->length; i++){ pElement = pCollection->item(i, (long)0); if(pElement !
You could use IHTMLDocument2::get_all[^] method to get an IHTMLElementCollection[^] interface. You can now iterate through this collection and then use IHTMLElement::get_tagName[^] to check if it is a table.
«_Superman_» I love work. It gives me something to do between weekends.
Microsoft MVP (Visual C++) -
You could use IHTMLDocument2::get_all[^] method to get an IHTMLElementCollection[^] interface. You can now iterate through this collection and then use IHTMLElement::get_tagName[^] to check if it is a table.
«_Superman_» I love work. It gives me something to do between weekends.
Microsoft MVP (Visual C++)Hello Sir, Thank you very much for the response. I have added this lines to the above code. MSHTML::IHTMLDocument2 *pDoc1 = NULL; MSHTML::IHTMLElementCollection *pColl = NULL; pColl = pDoc1->get_all(L"table"); int y = pColl->length; for(int x = 0; x < y ; x++) { } i am getting the following Error :- error C2664: 'MSHTML::IHTMLDocument2::get_all' : cannot convert parameter 1 from 'const wchar_t [6]' to 'MSHTML::IHTMLElementCollection **'
-
Hello Sir, Thank you very much for the response. I have added this lines to the above code. MSHTML::IHTMLDocument2 *pDoc1 = NULL; MSHTML::IHTMLElementCollection *pColl = NULL; pColl = pDoc1->get_all(L"table"); int y = pColl->length; for(int x = 0; x < y ; x++) { } i am getting the following Error :- error C2664: 'MSHTML::IHTMLDocument2::get_all' : cannot convert parameter 1 from 'const wchar_t [6]' to 'MSHTML::IHTMLElementCollection **'
NaveenHS wrote:
pColl = pDoc1->get_all(L"table");
This is wrong.
get_all
will give you all elements and not all tables.get_all
takes theIHTMLElementCollection
pointer as the parameter. As a general rule, all COM calls in C++ will return anHRESULT
. Except maybeAddRef
andRelease
. So you will have to check the tag name in a loop to see if they are tables. So theget_all
call would look like this -pDoc1->get_all(&pColl);
I would recommend using ATL here. Otherwise you will have to remember to call
Release
for all these pointers. So my recommended way would look like this -CComPtrMSHTML::IHTMLElementCollection pColl = NULL;
pDoc1->get_all(&pColl);You can use the IHTMLElementCollection::item[^] method to get an
IDispatch
pointer to each element in the collection. This can in turn be QIed to an IHTMLElement[^] or IHTMLDOMNode[^] interface. Check the methods of the interfaces to identify the table tags that you need.«_Superman_» I love work. It gives me something to do between weekends.
Microsoft MVP (Visual C++) -
NaveenHS wrote:
pColl = pDoc1->get_all(L"table");
This is wrong.
get_all
will give you all elements and not all tables.get_all
takes theIHTMLElementCollection
pointer as the parameter. As a general rule, all COM calls in C++ will return anHRESULT
. Except maybeAddRef
andRelease
. So you will have to check the tag name in a loop to see if they are tables. So theget_all
call would look like this -pDoc1->get_all(&pColl);
I would recommend using ATL here. Otherwise you will have to remember to call
Release
for all these pointers. So my recommended way would look like this -CComPtrMSHTML::IHTMLElementCollection pColl = NULL;
pDoc1->get_all(&pColl);You can use the IHTMLElementCollection::item[^] method to get an
IDispatch
pointer to each element in the collection. This can in turn be QIed to an IHTMLElement[^] or IHTMLDOMNode[^] interface. Check the methods of the interfaces to identify the table tags that you need.«_Superman_» I love work. It gives me something to do between weekends.
Microsoft MVP (Visual C++)Hello Sir, Thanks a lot for the reply. I made changes as mentioned by you, but still i am not able to extract the data from the table. can u please tell what change i have to make.
MSHTML::IHTMLDocument2Ptr pDoc; MSHTML::IHTMLDocument3Ptr pDoc3; MSHTML::IHTMLElementCollectionPtr pCollection; MSHTML::IHTMLElementPtr pElement;
HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER,
IID_IHTMLDocument2, (void**)&pDoc);pDoc3 = pDoc;
pDoc->get\_all(&pCollection); pCollection = pDoc3->getElementsByTagName(L"table"); for(long i=0; i<pCollection->length; i++){ pElement = pCollection->item(i, (long)0); if(pElement != NULL){ m\_wndLinksList.AddString((LPCTSTR)bstr\_t(pElement->getAttribute("table"),10)); } }
Error :- error C2660: 'MSHTML::IHTMLElement::getAttribute' : function does not take 1 arguments
-
Hello Sir, Thanks a lot for the reply. I made changes as mentioned by you, but still i am not able to extract the data from the table. can u please tell what change i have to make.
MSHTML::IHTMLDocument2Ptr pDoc; MSHTML::IHTMLDocument3Ptr pDoc3; MSHTML::IHTMLElementCollectionPtr pCollection; MSHTML::IHTMLElementPtr pElement;
HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER,
IID_IHTMLDocument2, (void**)&pDoc);pDoc3 = pDoc;
pDoc->get\_all(&pCollection); pCollection = pDoc3->getElementsByTagName(L"table"); for(long i=0; i<pCollection->length; i++){ pElement = pCollection->item(i, (long)0); if(pElement != NULL){ m\_wndLinksList.AddString((LPCTSTR)bstr\_t(pElement->getAttribute("table"),10)); } }
Error :- error C2660: 'MSHTML::IHTMLElement::getAttribute' : function does not take 1 arguments
The syntax for getAttribute is not correct. Do it this way -
CComBSTR name("table");
VARIANT result;
pElement->getAttribute(name, 0, &result);«_Superman_» I love work. It gives me something to do between weekends.
Microsoft MVP (Visual C++) -
The syntax for getAttribute is not correct. Do it this way -
CComBSTR name("table");
VARIANT result;
pElement->getAttribute(name, 0, &result);«_Superman_» I love work. It gives me something to do between weekends.
Microsoft MVP (Visual C++)Hello sir, I added the code , still getting the error.
pDoc->get_all(&pCollection);
pCollection = pDoc3->getElementsByTagName("table"); CComBSTR name("table"); VARIANT result; for(long i=0; i<pCollection->length; i++){ pElement = pCollection->item(i, (long)0); if(pElement != NULL){ m\_wndLinksList.AddString((LPCTSTR)bstr\_t(pElement->getAttribute(name, 0, &result))); } }
I am getting the below error :- error C2660: 'MSHTML::IHTMLElement::getAttribute' : function does not take 3 arguments