Reading HTML File
-
I have to read HTML files in a project. I am using streamReader to do that but when i read the document I get all the tags etc with it. Is there any way to only read the data (which gets displayed when u view the page in web browser) rather than the whole source.
o O º(`'·.,(`'·., ☆,.·''),.·'')º O o° »·'"`»* *☆ t4ure4n ☆* *«·'"`« °o O º(,.·''(,.·'' ☆`'·.,)`'·.,)º O o°
-
I have to read HTML files in a project. I am using streamReader to do that but when i read the document I get all the tags etc with it. Is there any way to only read the data (which gets displayed when u view the page in web browser) rather than the whole source.
o O º(`'·.,(`'·., ☆,.·''),.·'')º O o° »·'"`»* *☆ t4ure4n ☆* *«·'"`« °o O º(,.·''(,.·'' ☆`'·.,)`'·.,)º O o°
May be this would help you :rose:
public string StripHTML(ref string source) { string result; result = source.Replace("\r", " "); result = result.Replace("\n", " "); result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*script([^>])*>", "", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"(<( )*(/)( )*script( )*>)", "", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"()[^>]*()", ""); result = System.Text.RegularExpressions.Regex.Replace(result, "<[^>]*>", ""); for (int count = 0; count < technicalStopWordArrayList.Count; count++) { result = result.Replace(technicalStopWordArrayList[count].ToString(), " "); } result = result.Replace("&", " "); return result.Trim(); }
-
May be this would help you :rose:
public string StripHTML(ref string source) { string result; result = source.Replace("\r", " "); result = result.Replace("\n", " "); result = System.Text.RegularExpressions.Regex.Replace(result, @"<( )*script([^>])*>", "", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"(<( )*(/)( )*script( )*>)", "", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, @"()[^>]*()", ""); result = System.Text.RegularExpressions.Regex.Replace(result, "<[^>]*>", ""); for (int count = 0; count < technicalStopWordArrayList.Count; count++) { result = result.Replace(technicalStopWordArrayList[count].ToString(), " "); } result = result.Replace("&", " "); return result.Trim(); }
Thanks It works fine but I have 1 more question... Is is possible to preserve href's I could have tried it my self but I don't know any thing about regular expressions so I have to rely on u. I just commented this because I don't know what it is...
for (int count = 0; count < technicalStopWordArrayList.Count; count++) { result = result.Replace(technicalStopWordArrayList[count].ToString(), " "); }
o O º(`'·.,(`'·., ☆,.·''),.·'')º O o° »·'"`»* *☆ t4ure4n ☆* *«·'"`« °o O º(,.·''(,.·'' ☆`'·.,)`'·.,)º O o°
-
Thanks It works fine but I have 1 more question... Is is possible to preserve href's I could have tried it my self but I don't know any thing about regular expressions so I have to rely on u. I just commented this because I don't know what it is...
for (int count = 0; count < technicalStopWordArrayList.Count; count++) { result = result.Replace(technicalStopWordArrayList[count].ToString(), " "); }
o O º(`'·.,(`'·., ☆,.·''),.·'')º O o° »·'"`»* *☆ t4ure4n ☆* *«·'"`« °o O º(,.·''(,.·'' ☆`'·.,)`'·.,)º O o°
Oops!!! Sorry, this was my code which i used it. You dont need it.Comment it out ;)
-
Oops!!! Sorry, this was my code which i used it. You dont need it.Comment it out ;)