Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. HTML Table tag text Scraping

HTML Table tag text Scraping

Scheduled Pinned Locked Moved C#
csharphelphtmlcsslinq
7 Posts 5 Posters 2 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • N Offline
    N Offline
    NaveenHS
    wrote on last edited by
    #1

    Hello Everyone, I have one simple static HTML file, with a table i am trying to extract the contents of the table to a file. can anyone please give some some suggestions how to proceed with this work. HTML File:- <html> <body> <table border="1"> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr> </table> </body> </html> I am trying to extract the text with this code :-

    using System.Diagnostics;
    using System.Net;
    using System.Collections.Generic;
    using System.Text.RegularExpressions;
    using System;
    using System.IO;
    using System.Linq;

    class Program
    {
    static void Main()
    {
    StreamReader str = new StreamReader("C:\\Sample1.html");
    string strLings = str.ReadToEnd();
    int startIndex = strLings.IndexOf("<table>");
    int endInedx = strLings.IndexOf("</table>") + "</table>".Length - startIndex;
    string strTab = strLings.Substring(startIndex, endInedx);
    str.Close();
    StreamWriter strWr = new StreamWriter("C:\\test2.txt", true);
    strWr.Write(strTab);
    strWr.Close();
    }
    }

    Problem is i am getting an error An unhandled exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll Additional information: StartIndex cannot be less than zero.

    L R P 3 Replies Last reply
    0
    • N NaveenHS

      Hello Everyone, I have one simple static HTML file, with a table i am trying to extract the contents of the table to a file. can anyone please give some some suggestions how to proceed with this work. HTML File:- <html> <body> <table border="1"> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr> </table> </body> </html> I am trying to extract the text with this code :-

      using System.Diagnostics;
      using System.Net;
      using System.Collections.Generic;
      using System.Text.RegularExpressions;
      using System;
      using System.IO;
      using System.Linq;

      class Program
      {
      static void Main()
      {
      StreamReader str = new StreamReader("C:\\Sample1.html");
      string strLings = str.ReadToEnd();
      int startIndex = strLings.IndexOf("<table>");
      int endInedx = strLings.IndexOf("</table>") + "</table>".Length - startIndex;
      string strTab = strLings.Substring(startIndex, endInedx);
      str.Close();
      StreamWriter strWr = new StreamWriter("C:\\test2.txt", true);
      strWr.Write(strTab);
      strWr.Close();
      }
      }

      Problem is i am getting an error An unhandled exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll Additional information: StartIndex cannot be less than zero.

      L Offline
      L Offline
      Luc Pattyn
      wrote on last edited by
      #2

      As you already figured out, a table starts and ends with a TABLE tag, holds rows delimited by TR tags, which in turn hold columns delimited by TD tags. You did make two mistakes: 1. tags can be upper- or lower-case in HTML (XHTML requires lower-case); 2. opening tags may contain extra information (see the border='1' attribute in your example). So code accordingly, either by providing your own GetDelimitedSubstring() method (similar to what you already have), or by using Regex. :)

      Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum

      Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.

      N 1 Reply Last reply
      0
      • L Luc Pattyn

        As you already figured out, a table starts and ends with a TABLE tag, holds rows delimited by TR tags, which in turn hold columns delimited by TD tags. You did make two mistakes: 1. tags can be upper- or lower-case in HTML (XHTML requires lower-case); 2. opening tags may contain extra information (see the border='1' attribute in your example). So code accordingly, either by providing your own GetDelimitedSubstring() method (similar to what you already have), or by using Regex. :)

        Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum

        Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.

        N Offline
        N Offline
        NaveenHS
        wrote on last edited by
        #3

        Thanks Luc Pattyn for the response. I corrected the mistakes as follows HTML File :- <html> <body> <table> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr> </table> </body> </html> Program :- static void Main() { StreamReader str = new StreamReader("C:\\Test.html"); string strLings = str.ReadToEnd(); int startIndex = strLings.IndexOf("<table>"); int endInedx = strLings.IndexOf("</table>") + "</table>".Length - startIndex; string strTab = strLings.Substring(1, endInedx); str.Close(); StreamWriter strWr = new StreamWriter("C:\\test2.txt", true); strWr.Write(strTab); strWr.Close(); Console.ReadLine(); } Getting the output as below written to the file .. Not getting the required out put .. :( I will try with Regular Expression now. html> <body> <table> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr

        N 1 Reply Last reply
        0
        • N NaveenHS

          Hello Everyone, I have one simple static HTML file, with a table i am trying to extract the contents of the table to a file. can anyone please give some some suggestions how to proceed with this work. HTML File:- <html> <body> <table border="1"> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr> </table> </body> </html> I am trying to extract the text with this code :-

          using System.Diagnostics;
          using System.Net;
          using System.Collections.Generic;
          using System.Text.RegularExpressions;
          using System;
          using System.IO;
          using System.Linq;

          class Program
          {
          static void Main()
          {
          StreamReader str = new StreamReader("C:\\Sample1.html");
          string strLings = str.ReadToEnd();
          int startIndex = strLings.IndexOf("<table>");
          int endInedx = strLings.IndexOf("</table>") + "</table>".Length - startIndex;
          string strTab = strLings.Substring(startIndex, endInedx);
          str.Close();
          StreamWriter strWr = new StreamWriter("C:\\test2.txt", true);
          strWr.Write(strTab);
          strWr.Close();
          }
          }

          Problem is i am getting an error An unhandled exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll Additional information: StartIndex cannot be less than zero.

          R Offline
          R Offline
          ragnaroknrol
          wrote on last edited by
          #4

          This question is an example of what every question here should be like. You posted the problem, gave data on the situation that was relevant, and posted code showing what you had tried. You asked for assistance and you have done a follow up with your new attempt and the results along with your reaction to that result. I wish every poster in this forum did as well. A well earned 5 and I hope your regex works. I still get headaches from them. Good luck.

          If I have accidentally said something witty, smart, or correct, it is purely by mistake and I apologize for it.

          L 1 Reply Last reply
          0
          • N NaveenHS

            Hello Everyone, I have one simple static HTML file, with a table i am trying to extract the contents of the table to a file. can anyone please give some some suggestions how to proceed with this work. HTML File:- <html> <body> <table border="1"> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr> </table> </body> </html> I am trying to extract the text with this code :-

            using System.Diagnostics;
            using System.Net;
            using System.Collections.Generic;
            using System.Text.RegularExpressions;
            using System;
            using System.IO;
            using System.Linq;

            class Program
            {
            static void Main()
            {
            StreamReader str = new StreamReader("C:\\Sample1.html");
            string strLings = str.ReadToEnd();
            int startIndex = strLings.IndexOf("<table>");
            int endInedx = strLings.IndexOf("</table>") + "</table>".Length - startIndex;
            string strTab = strLings.Substring(startIndex, endInedx);
            str.Close();
            StreamWriter strWr = new StreamWriter("C:\\test2.txt", true);
            strWr.Write(strTab);
            strWr.Close();
            }
            }

            Problem is i am getting an error An unhandled exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll Additional information: StartIndex cannot be less than zero.

            P Offline
            P Offline
            PIEBALDconsult
            wrote on last edited by
            #5

            Please refer to this[^] post. :-D

            1 Reply Last reply
            0
            • R ragnaroknrol

              This question is an example of what every question here should be like. You posted the problem, gave data on the situation that was relevant, and posted code showing what you had tried. You asked for assistance and you have done a follow up with your new attempt and the results along with your reaction to that result. I wish every poster in this forum did as well. A well earned 5 and I hope your regex works. I still get headaches from them. Good luck.

              If I have accidentally said something witty, smart, or correct, it is purely by mistake and I apologize for it.

              L Offline
              L Offline
              Luc Pattyn
              wrote on last edited by
              #6

              Are you kidding? I mentioned two mistakes in his code, he fixed none of them, instead he changed the input data. And my sig (as most always telling people to use PRE tags) got completely ignored too. I wonder why I'm still replying to questions? PS: I wish the pinning bug got finally fixed, comparing two posts is pretty hard right now. :)

              Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum

              Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.

              1 Reply Last reply
              0
              • N NaveenHS

                Thanks Luc Pattyn for the response. I corrected the mistakes as follows HTML File :- <html> <body> <table> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr> </table> </body> </html> Program :- static void Main() { StreamReader str = new StreamReader("C:\\Test.html"); string strLings = str.ReadToEnd(); int startIndex = strLings.IndexOf("<table>"); int endInedx = strLings.IndexOf("</table>") + "</table>".Length - startIndex; string strTab = strLings.Substring(1, endInedx); str.Close(); StreamWriter strWr = new StreamWriter("C:\\test2.txt", true); strWr.Write(strTab); strWr.Close(); Console.ReadLine(); } Getting the output as below written to the file .. Not getting the required out put .. :( I will try with Regular Expression now. html> <body> <table> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr

                N Offline
                N Offline
                Not Active
                wrote on last edited by
                #7

                Maybe this will help, Screen Scraping with C# for ASP.NET[^]


                I know the language. I've read a book. - _Madmatt

                1 Reply Last reply
                0
                Reply
                • Reply as topic
                Log in to reply
                • Oldest to Newest
                • Newest to Oldest
                • Most Votes


                • Login

                • Don't have an account? Register

                • Login or register to search.
                • First post
                  Last post
                0
                • Categories
                • Recent
                • Tags
                • Popular
                • World
                • Users
                • Groups