Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. IHTMLDocument2 for Table Par [modified]

IHTMLDocument2 for Table Par [modified]

Scheduled Pinned Locked Moved C / C++ / MFC
htmlcomjsonquestion
7 Posts 2 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • N Offline
    N Offline
    NaveenHS
    wrote on last edited by
    #1

    Hello Everyone, This is my 2nd Post regarding the Text extraction from the WebPages. In my Previous post David Crow suggested me use IHTMLDocument2 interface, In the code project depository I found this application Parsing HTML using MSHTML [] by Philip Patrick ..Extracts all the links in the WebPages Using HREF Tag, Can I use the same application to extract the Text from the Table from the WebPages ? I searched for Table tag from which I can extract the text, I did not find any information. Can anyone please tell me is it possible to use MSHTML using IHTMLDocument2 interface can I extract the Text from the <table> Tag. Thanking you, Naveen HS.

    void CTestDlg::OnBgo()
    {
    UpdateData();
    CWaitCursor wait;
    if(m_csFilename.IsEmpty()){
    AfxMessageBox(_T("Please specify the file to parse"));
    return;
    }
    CFile f;

    //let's open file and read it into CString (u can use any buffer to read though
    if (f.Open(m\_csFilename, CFile::modeRead|CFile::shareDenyNone)) {
    	m\_wndLinksList.ResetContent();
    	CString csWholeFile;
    	f.Read(csWholeFile.GetBuffer(f.GetLength()), f.GetLength());
    	csWholeFile.ReleaseBuffer(f.GetLength());
    	f.Close();
    
    	//declare our MSHTML variables and create a document
    	MSHTML::IHTMLDocument2Ptr pDoc;
    	MSHTML::IHTMLDocument3Ptr pDoc3;
    	MSHTML::IHTMLElementCollectionPtr pCollection;
    	MSHTML::IHTMLElementPtr pElement;
    
    	HRESULT hr = CoCreateInstance(CLSID\_HTMLDocument, NULL, CLSCTX\_INPROC\_SERVER, 
    		IID\_IHTMLDocument2, (void\*\*)&pDoc);
    	
    	//put the code into SAFEARRAY and write it into document
    	SAFEARRAY\* psa = SafeArrayCreateVector(VT\_VARIANT, 0, 1);
    	VARIANT \*param;
    	bstr\_t bsData = (LPCTSTR)csWholeFile;
    	hr = SafeArrayAccessData(psa, (LPVOID\*)&param);
    	param->vt = VT\_BSTR;
    	param->bstrVal = (BSTR)bsData;
    	
    	hr = pDoc->write(psa);
    	hr = pDoc->close();
    	
    	SafeArrayDestroy(psa);
    
    	//I'll use IHTMLDocument3 to retrieve tags. Note it is available only in IE5+
    	//If you don't want to use it, u can just run through all tags in HTML
    	//(IHTMLDocument2->all property)
    	pDoc3 = pDoc;
    	
    	//display HREF parameter of every link (A tag) in ListBox
    	pCollection = pDoc3->getElementsByTagName(L"A");
    	for(long i=0; i<pCollection->length; i++){
    		pElement = pCollection->item(i, (long)0);
    		if(pElement !
    
    _ 1 Reply Last reply
    0
    • N NaveenHS

      Hello Everyone, This is my 2nd Post regarding the Text extraction from the WebPages. In my Previous post David Crow suggested me use IHTMLDocument2 interface, In the code project depository I found this application Parsing HTML using MSHTML [] by Philip Patrick ..Extracts all the links in the WebPages Using HREF Tag, Can I use the same application to extract the Text from the Table from the WebPages ? I searched for Table tag from which I can extract the text, I did not find any information. Can anyone please tell me is it possible to use MSHTML using IHTMLDocument2 interface can I extract the Text from the <table> Tag. Thanking you, Naveen HS.

      void CTestDlg::OnBgo()
      {
      UpdateData();
      CWaitCursor wait;
      if(m_csFilename.IsEmpty()){
      AfxMessageBox(_T("Please specify the file to parse"));
      return;
      }
      CFile f;

      //let's open file and read it into CString (u can use any buffer to read though
      if (f.Open(m\_csFilename, CFile::modeRead|CFile::shareDenyNone)) {
      	m\_wndLinksList.ResetContent();
      	CString csWholeFile;
      	f.Read(csWholeFile.GetBuffer(f.GetLength()), f.GetLength());
      	csWholeFile.ReleaseBuffer(f.GetLength());
      	f.Close();
      
      	//declare our MSHTML variables and create a document
      	MSHTML::IHTMLDocument2Ptr pDoc;
      	MSHTML::IHTMLDocument3Ptr pDoc3;
      	MSHTML::IHTMLElementCollectionPtr pCollection;
      	MSHTML::IHTMLElementPtr pElement;
      
      	HRESULT hr = CoCreateInstance(CLSID\_HTMLDocument, NULL, CLSCTX\_INPROC\_SERVER, 
      		IID\_IHTMLDocument2, (void\*\*)&pDoc);
      	
      	//put the code into SAFEARRAY and write it into document
      	SAFEARRAY\* psa = SafeArrayCreateVector(VT\_VARIANT, 0, 1);
      	VARIANT \*param;
      	bstr\_t bsData = (LPCTSTR)csWholeFile;
      	hr = SafeArrayAccessData(psa, (LPVOID\*)&param);
      	param->vt = VT\_BSTR;
      	param->bstrVal = (BSTR)bsData;
      	
      	hr = pDoc->write(psa);
      	hr = pDoc->close();
      	
      	SafeArrayDestroy(psa);
      
      	//I'll use IHTMLDocument3 to retrieve tags. Note it is available only in IE5+
      	//If you don't want to use it, u can just run through all tags in HTML
      	//(IHTMLDocument2->all property)
      	pDoc3 = pDoc;
      	
      	//display HREF parameter of every link (A tag) in ListBox
      	pCollection = pDoc3->getElementsByTagName(L"A");
      	for(long i=0; i<pCollection->length; i++){
      		pElement = pCollection->item(i, (long)0);
      		if(pElement !
      
      _ Offline
      _ Offline
      _Superman_
      wrote on last edited by
      #2

      You could use IHTMLDocument2::get_all[^] method to get an IHTMLElementCollection[^] interface. You can now iterate through this collection and then use IHTMLElement::get_tagName[^] to check if it is a table.

      «_Superman_» I love work. It gives me something to do between weekends.
      Microsoft MVP (Visual C++)

      N 1 Reply Last reply
      0
      • _ _Superman_

        You could use IHTMLDocument2::get_all[^] method to get an IHTMLElementCollection[^] interface. You can now iterate through this collection and then use IHTMLElement::get_tagName[^] to check if it is a table.

        «_Superman_» I love work. It gives me something to do between weekends.
        Microsoft MVP (Visual C++)

        N Offline
        N Offline
        NaveenHS
        wrote on last edited by
        #3

        Hello Sir, Thank you very much for the response. I have added this lines to the above code. MSHTML::IHTMLDocument2 *pDoc1 = NULL; MSHTML::IHTMLElementCollection *pColl = NULL; pColl = pDoc1->get_all(L"table"); int y = pColl->length; for(int x = 0; x < y ; x++) { } i am getting the following Error :- error C2664: 'MSHTML::IHTMLDocument2::get_all' : cannot convert parameter 1 from 'const wchar_t [6]' to 'MSHTML::IHTMLElementCollection **'

        _ 1 Reply Last reply
        0
        • N NaveenHS

          Hello Sir, Thank you very much for the response. I have added this lines to the above code. MSHTML::IHTMLDocument2 *pDoc1 = NULL; MSHTML::IHTMLElementCollection *pColl = NULL; pColl = pDoc1->get_all(L"table"); int y = pColl->length; for(int x = 0; x < y ; x++) { } i am getting the following Error :- error C2664: 'MSHTML::IHTMLDocument2::get_all' : cannot convert parameter 1 from 'const wchar_t [6]' to 'MSHTML::IHTMLElementCollection **'

          _ Offline
          _ Offline
          _Superman_
          wrote on last edited by
          #4

          NaveenHS wrote:

          pColl = pDoc1->get_all(L"table");

          This is wrong. get_all will give you all elements and not all tables. get_all takes the IHTMLElementCollection pointer as the parameter. As a general rule, all COM calls in C++ will return an HRESULT. Except maybe AddRef and Release. So you will have to check the tag name in a loop to see if they are tables. So the get_all call would look like this -

          pDoc1->get_all(&pColl);

          I would recommend using ATL here. Otherwise you will have to remember to call Release for all these pointers. So my recommended way would look like this -

          CComPtrMSHTML::IHTMLElementCollection pColl = NULL;
          pDoc1->get_all(&pColl);

          You can use the IHTMLElementCollection::item[^] method to get an IDispatch pointer to each element in the collection. This can in turn be QIed to an IHTMLElement[^] or IHTMLDOMNode[^] interface. Check the methods of the interfaces to identify the table tags that you need.

          «_Superman_» I love work. It gives me something to do between weekends.
          Microsoft MVP (Visual C++)

          N 1 Reply Last reply
          0
          • _ _Superman_

            NaveenHS wrote:

            pColl = pDoc1->get_all(L"table");

            This is wrong. get_all will give you all elements and not all tables. get_all takes the IHTMLElementCollection pointer as the parameter. As a general rule, all COM calls in C++ will return an HRESULT. Except maybe AddRef and Release. So you will have to check the tag name in a loop to see if they are tables. So the get_all call would look like this -

            pDoc1->get_all(&pColl);

            I would recommend using ATL here. Otherwise you will have to remember to call Release for all these pointers. So my recommended way would look like this -

            CComPtrMSHTML::IHTMLElementCollection pColl = NULL;
            pDoc1->get_all(&pColl);

            You can use the IHTMLElementCollection::item[^] method to get an IDispatch pointer to each element in the collection. This can in turn be QIed to an IHTMLElement[^] or IHTMLDOMNode[^] interface. Check the methods of the interfaces to identify the table tags that you need.

            «_Superman_» I love work. It gives me something to do between weekends.
            Microsoft MVP (Visual C++)

            N Offline
            N Offline
            NaveenHS
            wrote on last edited by
            #5

            Hello Sir, Thanks a lot for the reply. I made changes as mentioned by you, but still i am not able to extract the data from the table. can u please tell what change i have to make.

                        MSHTML::IHTMLDocument2Ptr pDoc;
            	MSHTML::IHTMLDocument3Ptr pDoc3;
            	MSHTML::IHTMLElementCollectionPtr pCollection;
            	MSHTML::IHTMLElementPtr pElement;
            

            HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER,
            IID_IHTMLDocument2, (void**)&pDoc);

            pDoc3 = pDoc;

            	pDoc->get\_all(&pCollection);
            
            	pCollection = pDoc3->getElementsByTagName(L"table");
            
            	
            	for(long i=0; i<pCollection->length; i++){
            		pElement = pCollection->item(i, (long)0);
            		if(pElement != NULL){
            			m\_wndLinksList.AddString((LPCTSTR)bstr\_t(pElement->getAttribute("table"),10));
            		}
            	}
            

            Error :- error C2660: 'MSHTML::IHTMLElement::getAttribute' : function does not take 1 arguments

            _ 1 Reply Last reply
            0
            • N NaveenHS

              Hello Sir, Thanks a lot for the reply. I made changes as mentioned by you, but still i am not able to extract the data from the table. can u please tell what change i have to make.

                          MSHTML::IHTMLDocument2Ptr pDoc;
              	MSHTML::IHTMLDocument3Ptr pDoc3;
              	MSHTML::IHTMLElementCollectionPtr pCollection;
              	MSHTML::IHTMLElementPtr pElement;
              

              HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER,
              IID_IHTMLDocument2, (void**)&pDoc);

              pDoc3 = pDoc;

              	pDoc->get\_all(&pCollection);
              
              	pCollection = pDoc3->getElementsByTagName(L"table");
              
              	
              	for(long i=0; i<pCollection->length; i++){
              		pElement = pCollection->item(i, (long)0);
              		if(pElement != NULL){
              			m\_wndLinksList.AddString((LPCTSTR)bstr\_t(pElement->getAttribute("table"),10));
              		}
              	}
              

              Error :- error C2660: 'MSHTML::IHTMLElement::getAttribute' : function does not take 1 arguments

              _ Offline
              _ Offline
              _Superman_
              wrote on last edited by
              #6

              The syntax for getAttribute is not correct. Do it this way -

              CComBSTR name("table");
              VARIANT result;
              pElement->getAttribute(name, 0, &result);

              «_Superman_» I love work. It gives me something to do between weekends.
              Microsoft MVP (Visual C++)

              N 1 Reply Last reply
              0
              • _ _Superman_

                The syntax for getAttribute is not correct. Do it this way -

                CComBSTR name("table");
                VARIANT result;
                pElement->getAttribute(name, 0, &result);

                «_Superman_» I love work. It gives me something to do between weekends.
                Microsoft MVP (Visual C++)

                N Offline
                N Offline
                NaveenHS
                wrote on last edited by
                #7

                Hello sir, I added the code , still getting the error.

                pDoc->get_all(&pCollection);

                	pCollection = pDoc3->getElementsByTagName("table");
                
                	CComBSTR name("table");
                	VARIANT result;
                
                	for(long i=0; i<pCollection->length; i++){
                		pElement = pCollection->item(i, (long)0);
                		if(pElement != NULL){
                		
                		m\_wndLinksList.AddString((LPCTSTR)bstr\_t(pElement->getAttribute(name, 0, &result)));
                
                		
                		}
                	}
                

                I am getting the below error :- error C2660: 'MSHTML::IHTMLElement::getAttribute' : function does not take 3 arguments

                1 Reply Last reply
                0
                Reply
                • Reply as topic
                Log in to reply
                • Oldest to Newest
                • Newest to Oldest
                • Most Votes


                • Login

                • Don't have an account? Register

                • Login or register to search.
                • First post
                  Last post
                0
                • Categories
                • Recent
                • Tags
                • Popular
                • World
                • Users
                • Groups