Stripping out HTML code
-
Hi, (Using C# in ASP.NET) I have a string that contains about 7K of HTML formatted text. I need to strip out all of the HTML codes and convert the and > etc. codes to spaces and > etc. Now I can write this myself but I'm guessing that there is already a bit of well worn code out here somewhere that already does this OR it is already built into the ASP.NET library (but I can't find it). Anybody have any pointers for me? The reason that I'm doing this is because I wish to search this chunk of HTML for certain words and don't want to search the HTML codes but just what the user sees. Is there a more elegant approach than that which I'm taking? Thanks! -- modified at 19:48 Wednesday 28th December, 2005
-
Hi, (Using C# in ASP.NET) I have a string that contains about 7K of HTML formatted text. I need to strip out all of the HTML codes and convert the and > etc. codes to spaces and > etc. Now I can write this myself but I'm guessing that there is already a bit of well worn code out here somewhere that already does this OR it is already built into the ASP.NET library (but I can't find it). Anybody have any pointers for me? The reason that I'm doing this is because I wish to search this chunk of HTML for certain words and don't want to search the HTML codes but just what the user sees. Is there a more elegant approach than that which I'm taking? Thanks! -- modified at 19:48 Wednesday 28th December, 2005
My first guess would be to make a regular expression that searches for opening
< _AnyCharactersUntillFirst_ >
and remove the match. This would strip the HTML tags. Though you will probably have to interpret some cases, for example JavaScript tags:Many people spend their life going to sleep when they’re not sleepy and waking up while they still are.</small></x-turndown>
-
My first guess would be to make a regular expression that searches for opening
< _AnyCharactersUntillFirst_ >
and remove the match. This would strip the HTML tags. Though you will probably have to interpret some cases, for example JavaScript tags:Many people spend their life going to sleep when they’re not sleepy and waking up while they still are.</small></x-turndown>
Thanks Mircea!! That will work because I don't have any script in the html. I'm reading through the Regex documentation and I can't find the syntax to use for the
< ??? >
part of the functions. How do you find <*>? Or how do you specify <*> in a regular expression? Thanks again for your help - much appreciated. -- modified at 20:53 Wednesday 28th December, 2005 -
Hi, (Using C# in ASP.NET) I have a string that contains about 7K of HTML formatted text. I need to strip out all of the HTML codes and convert the and > etc. codes to spaces and > etc. Now I can write this myself but I'm guessing that there is already a bit of well worn code out here somewhere that already does this OR it is already built into the ASP.NET library (but I can't find it). Anybody have any pointers for me? The reason that I'm doing this is because I wish to search this chunk of HTML for certain words and don't want to search the HTML codes but just what the user sees. Is there a more elegant approach than that which I'm taking? Thanks! -- modified at 19:48 Wednesday 28th December, 2005
This might work also All html is a tag. everything So if you loaded the document in to XmlDocument then you can skip to what you want with SelectSingleNode or to grab mutliple tags SelectNodes then to grab the data without tags Node.InnerText will grab it without the tags. That might be closer to your end result with having to do tons of homeade logic You can view the DOM with your locals window XmlDocument x = new XmlDocument(); x.Load(filename); x.Select.... 1 line of code equals many bugs. So don't write any!!
-
Thanks Mircea!! That will work because I don't have any script in the html. I'm reading through the Regex documentation and I can't find the syntax to use for the
< ??? >
part of the functions. How do you find <*>? Or how do you specify <*> in a regular expression? Thanks again for your help - much appreciated. -- modified at 20:53 Wednesday 28th December, 2005using System.Text.RegularExpressions; Regex r; Match m; r = new Regex(@"<(?<1>[^>]*)>", //regullar expression for catching: < AnyCharUntill > RegexOptions.Singleline | RegexOptions.IgnoreCase/*|RegexOptions.Compiled|*/); for (m = r.Match(input); m.Success; m = m.NextMatch()) //input is the string in which to search. { input = input.Replace(m.Groups[0].ToString(), ""); //m.Groups[0] will reference the entire regular expression meaning "" //m.Groups[0] will refrence backreference 1: "exampletag exampletext" }
regards, Mircea Many people spend their life going to sleep when they’re not sleepy and waking up while they still are.