Reading HTML File

t4ure4n

I have to read HTML files in a project. I am using streamReader to do that but when i read the document I get all the tags etc with it. Is there any way to only read the data (which gets displayed when u view the page in web browser) rather than the whole source.

o O º(`'·.,(`'·., ☆,.·''),.·'')º O o° »·'"`»* *☆ t4ure4n ☆* *«·'"`« °o O º(,.·''(,.·'' ☆`'·.,)`'·.,)º O o°

Manas Bhardwaj

May be this would help you :rose:

public string StripHTML(ref string source)
       {
           string result;
           result = source.Replace("\r", " ");
           result = result.Replace("\n", " ");
           result = System.Text.RegularExpressions.Regex.Replace(result,
                       @"<( )*script([^>])*>", "",
                       System.Text.RegularExpressions.RegexOptions.IgnoreCase);
           result = System.Text.RegularExpressions.Regex.Replace(result,
                    @"(<( )*(/)( )*script( )*>)", "",
                    System.Text.RegularExpressions.RegexOptions.IgnoreCase);
           result = System.Text.RegularExpressions.Regex.Replace(result, @"()[^>]*()", "");
           result = System.Text.RegularExpressions.Regex.Replace(result, "<[^>]*>", "");
           for (int count = 0; count < technicalStopWordArrayList.Count; count++)
           {
               result = result.Replace(technicalStopWordArrayList[count].ToString(), " ");
           }
           result = result.Replace("&", " ");

           return result.Trim();
       }

t4ure4n

Thanks It works fine but I have 1 more question... Is is possible to preserve href's I could have tried it my self but I don't know any thing about regular expressions so I have to rely on u. I just commented this because I don't know what it is...

for (int count = 0; count < technicalStopWordArrayList.Count; count++)
{
result = result.Replace(technicalStopWordArrayList[count].ToString(), " ");
}

o O º(`'·.,(`'·., ☆,.·''),.·'')º O o° »·'"`»* *☆ t4ure4n ☆* *«·'"`« °o O º(,.·''(,.·'' ☆`'·.,)`'·.,)º O o°

Manas Bhardwaj

Oops!!! Sorry, this was my code which i used it. You dont need it.Comment it out ;)

t4ure4n

Thanks... Jus 1 question Is is possible to preserve href's (Hyperlinks) if yes? How

o O º(`'·.,(`'·., ☆,.·''),.·'')º O o° »·'"`»* *☆ t4ure4n ☆* *«·'"`« °o O º(,.·''(,.·'' ☆`'·.,)`'·.,)º O o°