HTML Table tag text Scraping
-
Hello Everyone, I have one simple static HTML file, with a table i am trying to extract the contents of the table to a file. can anyone please give some some suggestions how to proceed with this work. HTML File:- <html> <body> <table border="1"> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr> </table> </body> </html> I am trying to extract the text with this code :-
using System.Diagnostics;
using System.Net;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System;
using System.IO;
using System.Linq;class Program
{
static void Main()
{
StreamReader str = new StreamReader("C:\\Sample1.html");
string strLings = str.ReadToEnd();
int startIndex = strLings.IndexOf("<table>");
int endInedx = strLings.IndexOf("</table>") + "</table>".Length - startIndex;
string strTab = strLings.Substring(startIndex, endInedx);
str.Close();
StreamWriter strWr = new StreamWriter("C:\\test2.txt", true);
strWr.Write(strTab);
strWr.Close();
}
}Problem is i am getting an error An unhandled exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll Additional information: StartIndex cannot be less than zero.
-
Hello Everyone, I have one simple static HTML file, with a table i am trying to extract the contents of the table to a file. can anyone please give some some suggestions how to proceed with this work. HTML File:- <html> <body> <table border="1"> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr> </table> </body> </html> I am trying to extract the text with this code :-
using System.Diagnostics;
using System.Net;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System;
using System.IO;
using System.Linq;class Program
{
static void Main()
{
StreamReader str = new StreamReader("C:\\Sample1.html");
string strLings = str.ReadToEnd();
int startIndex = strLings.IndexOf("<table>");
int endInedx = strLings.IndexOf("</table>") + "</table>".Length - startIndex;
string strTab = strLings.Substring(startIndex, endInedx);
str.Close();
StreamWriter strWr = new StreamWriter("C:\\test2.txt", true);
strWr.Write(strTab);
strWr.Close();
}
}Problem is i am getting an error An unhandled exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll Additional information: StartIndex cannot be less than zero.
As you already figured out, a table starts and ends with a TABLE tag, holds rows delimited by TR tags, which in turn hold columns delimited by TD tags. You did make two mistakes: 1. tags can be upper- or lower-case in HTML (XHTML requires lower-case); 2. opening tags may contain extra information (see the border='1' attribute in your example). So code accordingly, either by providing your own GetDelimitedSubstring() method (similar to what you already have), or by using Regex. :)
Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.
-
As you already figured out, a table starts and ends with a TABLE tag, holds rows delimited by TR tags, which in turn hold columns delimited by TD tags. You did make two mistakes: 1. tags can be upper- or lower-case in HTML (XHTML requires lower-case); 2. opening tags may contain extra information (see the border='1' attribute in your example). So code accordingly, either by providing your own GetDelimitedSubstring() method (similar to what you already have), or by using Regex. :)
Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.
Thanks Luc Pattyn for the response. I corrected the mistakes as follows HTML File :- <html> <body> <table> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr> </table> </body> </html> Program :-
static void Main() { StreamReader str = new StreamReader("C:\\Test.html"); string strLings = str.ReadToEnd(); int startIndex = strLings.IndexOf("<table>"); int endInedx = strLings.IndexOf("</table>") + "</table>".Length - startIndex; string strTab = strLings.Substring(1, endInedx); str.Close(); StreamWriter strWr = new StreamWriter("C:\\test2.txt", true); strWr.Write(strTab); strWr.Close(); Console.ReadLine(); }
Getting the output as below written to the file .. Not getting the required out put .. :( I will try with Regular Expression now. html> <body> <table> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr -
Hello Everyone, I have one simple static HTML file, with a table i am trying to extract the contents of the table to a file. can anyone please give some some suggestions how to proceed with this work. HTML File:- <html> <body> <table border="1"> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr> </table> </body> </html> I am trying to extract the text with this code :-
using System.Diagnostics;
using System.Net;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System;
using System.IO;
using System.Linq;class Program
{
static void Main()
{
StreamReader str = new StreamReader("C:\\Sample1.html");
string strLings = str.ReadToEnd();
int startIndex = strLings.IndexOf("<table>");
int endInedx = strLings.IndexOf("</table>") + "</table>".Length - startIndex;
string strTab = strLings.Substring(startIndex, endInedx);
str.Close();
StreamWriter strWr = new StreamWriter("C:\\test2.txt", true);
strWr.Write(strTab);
strWr.Close();
}
}Problem is i am getting an error An unhandled exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll Additional information: StartIndex cannot be less than zero.
This question is an example of what every question here should be like. You posted the problem, gave data on the situation that was relevant, and posted code showing what you had tried. You asked for assistance and you have done a follow up with your new attempt and the results along with your reaction to that result. I wish every poster in this forum did as well. A well earned 5 and I hope your regex works. I still get headaches from them. Good luck.
If I have accidentally said something witty, smart, or correct, it is purely by mistake and I apologize for it.
-
Hello Everyone, I have one simple static HTML file, with a table i am trying to extract the contents of the table to a file. can anyone please give some some suggestions how to proceed with this work. HTML File:- <html> <body> <table border="1"> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr> </table> </body> </html> I am trying to extract the text with this code :-
using System.Diagnostics;
using System.Net;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System;
using System.IO;
using System.Linq;class Program
{
static void Main()
{
StreamReader str = new StreamReader("C:\\Sample1.html");
string strLings = str.ReadToEnd();
int startIndex = strLings.IndexOf("<table>");
int endInedx = strLings.IndexOf("</table>") + "</table>".Length - startIndex;
string strTab = strLings.Substring(startIndex, endInedx);
str.Close();
StreamWriter strWr = new StreamWriter("C:\\test2.txt", true);
strWr.Write(strTab);
strWr.Close();
}
}Problem is i am getting an error An unhandled exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll Additional information: StartIndex cannot be less than zero.
-
This question is an example of what every question here should be like. You posted the problem, gave data on the situation that was relevant, and posted code showing what you had tried. You asked for assistance and you have done a follow up with your new attempt and the results along with your reaction to that result. I wish every poster in this forum did as well. A well earned 5 and I hope your regex works. I still get headaches from them. Good luck.
If I have accidentally said something witty, smart, or correct, it is purely by mistake and I apologize for it.
Are you kidding? I mentioned two mistakes in his code, he fixed none of them, instead he changed the input data. And my sig (as most always telling people to use PRE tags) got completely ignored too. I wonder why I'm still replying to questions? PS: I wish the pinning bug got finally fixed, comparing two posts is pretty hard right now. :)
Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.
-
Thanks Luc Pattyn for the response. I corrected the mistakes as follows HTML File :- <html> <body> <table> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </tr> </table> </body> </html> Program :-
static void Main() { StreamReader str = new StreamReader("C:\\Test.html"); string strLings = str.ReadToEnd(); int startIndex = strLings.IndexOf("<table>"); int endInedx = strLings.IndexOf("</table>") + "</table>".Length - startIndex; string strTab = strLings.Substring(1, endInedx); str.Close(); StreamWriter strWr = new StreamWriter("C:\\test2.txt", true); strWr.Write(strTab); strWr.Close(); Console.ReadLine(); }
Getting the output as below written to the file .. Not getting the required out put .. :( I will try with Regular Expression now. html> <body> <table> <tr> <th>Team Name</th> <th>Place</th> </tr> <tr> <td>Kings XI Punjab</td> <td>Punjab</td> </tr> <tr> <td>Chennai Super Kings</td> <td>Chennai</td> </tr> <tr> <td>Deccan Chargers</td> <td>Hydrabad</td> </trMaybe this will help, Screen Scraping with C# for ASP.NET[^]
I know the language. I've read a book. - _Madmatt