reg expression

nesfrank

Currently I have some code that does some searches via the reg expression in c#. I have a fuynction that returns a reg expression that simply strings and clean the string from HTML tags and the function is below(1). This function works just fine. Recently the code has a bug and the bug seems to be related with the following: - between words there are multiple spaces and/or multiple nbsp so if the user searches for the "this is" it does not give the right matches if in between "this" and "is" may be multiple spaces, combination of amp;nbsp; and spaces and carriage returns. I am trying to do a test and do the code below which works as a test: string str = "This is a test sentese."; str = Regex.Replace(str, @" ", " "); //Remove nbsp str = Regex.Replace(str, @"\s+", " "); //Remove duplicate spaces. but how can I add the above logic in one sentese? how can I add the above logic part of the regular expression that strips the HTML below? Please help. 1. Function that cleans from html: public static Regex GetRegExpStripHTML() { Regex r = new Regex(@"(<\/?)(?i:(?<element>a(bbr|cronym|ddress|pplet|rea)?|b(ase(f" + @"ont)?|do|ig|lockquote|ody|r|utton)?|c(aption|enter|ite|(o(de" + @"|l(group)?)))|d(d|el|fn|i(r|v)|l|t)|em|f(ieldset|o(nt|rm)|ra" + @"me(set)?)|h([1-6]|ead|r|tml)|i(frame|mg|n(put|s)|sindex)?|kb" + @"d|l(abel|egend|i(nk)?)|m(ap|e(nu|ta))|no(frames|script)|o(bj" + @"ect|l|pt(group|ion))|p(aram|re)?|q|s(amp|cript|elect|mall|pa" + @"n|t(r(ike|ong)|yle)|u(b|p))|t(able|body|d|extarea|foot|h|itl" + @"e|r|t)|u(l)?|var))(\s(?<attr>.+?))*>"); return r; }

Daniel Grunwald

A regex just searches for a pattern. Assuming the strip regex below is used to replace html tags with the empty string, you cannot use it to replace anything with a space. You can combine two patterns like this: str = Regex.Replace(str, @"( |\s)+", " "); Also, your GetRegExpStripHTML doesn't work. It strips only a few well-formed known tags - that's not enough to prevent cross-site scripting exploits. What about <SCRIPT> or < SCRIPT> ? What about tags you forgot, like <BODY onload="...">? What about encoding the characters using some far-east codepage that your app doesn't understand? The codepage auto-detection of the browser might detect the codepage and execute the scripts. What about null bytes like <SCR\0IPT>? Your regex won't see the script tag, but Internet Explorer still does. What about any of a huge number of other tricks to evade XSS filters? You need to encode any < > &, a blacklist won't get you anywhere as browsers have lots of ways to execute code that you never heard about. And even that isn't 100% safe when playing with the charset tricks. See http://ha.ckers.org/xss.html[^] to get an idea about what kind of attacks on XSS filters are possible.

nesfrank

the one that strips html is used only for some internbat clean up. that;'s fine. is there anyway I can add the patern you added above to that reg expression? please help?

Daniel Grunwald

You could take the huge pattern and append |( |\s)+. But you cannot control what the pattern is replaced with - that's in some other place in the code, where the GetRegExpStripHTML().Replace method is called. I would suggest removing the GetRegExpStripHTML() method and instead providing a CleanupHTML(string) method - that way, you can apply multiple regular expressions and don't have to do everything with a single replacement using a monster pattern.