Parsing a web page to get just the <p> inner text.
-
I have a method that should extract text between "paragraph" tags. But I am getting css text and javascript code also. Here is my code(be kind I am self taught).
private static string GetParagraphs(string webPage)
{
string subWebPage = webPage;
int subWebPageStartIndex = 0;
string paragraph = "";
string paragraphs = "";
int startIndex = 0;
int endIndex = 0;
while (subWebPageStartIndex < webPage.LastIndexOf(""))
{
subWebPage = webPage.Substring(subWebPageStartIndex);
startIndex = subWebPage.IndexOf("") + 3 + subWebPageStartIndex;
endIndex = subWebPage.IndexOf("") + subWebPageStartIndex;
paragraph = webPage.Substring(startIndex, endIndex);
paragraphs = paragraphs + " " + paragraph; //TODO: Refactor to use StringBuilder class.
subWebPageStartIndex = endIndex + 4;
Debug.WriteLine(paragraph);
}
return paragraphs;
}Maybe You can see where I have messed up.:confused: Thank You for taking the time to read this. Frazzle the name say's it all
Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
John F. Woods
-
I have a method that should extract text between "paragraph" tags. But I am getting css text and javascript code also. Here is my code(be kind I am self taught).
private static string GetParagraphs(string webPage)
{
string subWebPage = webPage;
int subWebPageStartIndex = 0;
string paragraph = "";
string paragraphs = "";
int startIndex = 0;
int endIndex = 0;
while (subWebPageStartIndex < webPage.LastIndexOf(""))
{
subWebPage = webPage.Substring(subWebPageStartIndex);
startIndex = subWebPage.IndexOf("") + 3 + subWebPageStartIndex;
endIndex = subWebPage.IndexOf("") + subWebPageStartIndex;
paragraph = webPage.Substring(startIndex, endIndex);
paragraphs = paragraphs + " " + paragraph; //TODO: Refactor to use StringBuilder class.
subWebPageStartIndex = endIndex + 4;
Debug.WriteLine(paragraph);
}
return paragraphs;
}Maybe You can see where I have messed up.:confused: Thank You for taking the time to read this. Frazzle the name say's it all
Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
John F. Woods
What you have done is BAD. Ideal way to handle this is to use a HTML parser and traverse the DOM to get the text that you need. Look at HTML Agility[^] project. If you are sure that you will always have a wellformed input, you could easily do this with regular expressions. Here is a working example.
public static List<string> GetAllParagraphValues(string input)
{
List<string> values = new List<string>();
Regex r = new Regex("<p[^>]*>(?<value>.*?)</p>", RegexOptions.IgnoreCase);
foreach (Match match in r.Matches(input))
{
values.Add(match.Groups["value"].Value);
}
return values;
}Best wishes, Navaneeth
-
What you have done is BAD. Ideal way to handle this is to use a HTML parser and traverse the DOM to get the text that you need. Look at HTML Agility[^] project. If you are sure that you will always have a wellformed input, you could easily do this with regular expressions. Here is a working example.
public static List<string> GetAllParagraphValues(string input)
{
List<string> values = new List<string>();
Regex r = new Regex("<p[^>]*>(?<value>.*?)</p>", RegexOptions.IgnoreCase);
foreach (Match match in r.Matches(input))
{
values.Add(match.Groups["value"].Value);
}
return values;
}Best wishes, Navaneeth
N a v a n e e t h wrote:
What you have done is BAD.
I knew this, it looks bad and did not work. :( Slowly I am learning now as for style that will come in time.
N a v a n e e t h wrote:
If you are sure that you will always have a wellformed input, you could easily do this with regular expressions
Where can I learn about regular expressions? Thank You
Frazzle the name say's it all
Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
John F. Woods
-
N a v a n e e t h wrote:
What you have done is BAD.
I knew this, it looks bad and did not work. :( Slowly I am learning now as for style that will come in time.
N a v a n e e t h wrote:
If you are sure that you will always have a wellformed input, you could easily do this with regular expressions
Where can I learn about regular expressions? Thank You
Frazzle the name say's it all
Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
John F. Woods