Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. reg expression

reg expression

Scheduled Pinned Locked Moved C#
regexhelpquestioncsharphtml
4 Posts 2 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • N Offline
    N Offline
    nesfrank
    wrote on last edited by
    #1

    Currently I have some code that does some searches via the reg expression in c#. I have a fuynction that returns a reg expression that simply strings and clean the string from HTML tags and the function is below(1). This function works just fine. Recently the code has a bug and the bug seems to be related with the following: - between words there are multiple spaces and/or multiple nbsp so if the user searches for the "this is" it does not give the right matches if in between "this" and "is" may be multiple spaces, combination of amp;nbsp; and spaces and carriage returns. I am trying to do a test and do the code below which works as a test: string str = "This  is   a test sentese."; str = Regex.Replace(str, @" ", " "); //Remove nbsp str = Regex.Replace(str, @"\s+", " "); //Remove duplicate spaces. but how can I add the above logic in one sentese? how can I add the above logic part of the regular expression that strips the HTML below? Please help. 1. Function that cleans from html: public static Regex GetRegExpStripHTML() { Regex r = new Regex(@"(<\/?)(?i:(?<element>a(bbr|cronym|ddress|pplet|rea)?|b(ase(f" + @"ont)?|do|ig|lockquote|ody|r|utton)?|c(aption|enter|ite|(o(de" + @"|l(group)?)))|d(d|el|fn|i(r|v)|l|t)|em|f(ieldset|o(nt|rm)|ra" + @"me(set)?)|h([1-6]|ead|r|tml)|i(frame|mg|n(put|s)|sindex)?|kb" + @"d|l(abel|egend|i(nk)?)|m(ap|e(nu|ta))|no(frames|script)|o(bj" + @"ect|l|pt(group|ion))|p(aram|re)?|q|s(amp|cript|elect|mall|pa" + @"n|t(r(ike|ong)|yle)|u(b|p))|t(able|body|d|extarea|foot|h|itl" + @"e|r|t)|u(l)?|var))(\s(?<attr>.+?))*>"); return r; }

    D 1 Reply Last reply
    0
    • N nesfrank

      Currently I have some code that does some searches via the reg expression in c#. I have a fuynction that returns a reg expression that simply strings and clean the string from HTML tags and the function is below(1). This function works just fine. Recently the code has a bug and the bug seems to be related with the following: - between words there are multiple spaces and/or multiple nbsp so if the user searches for the "this is" it does not give the right matches if in between "this" and "is" may be multiple spaces, combination of amp;nbsp; and spaces and carriage returns. I am trying to do a test and do the code below which works as a test: string str = "This  is   a test sentese."; str = Regex.Replace(str, @" ", " "); //Remove nbsp str = Regex.Replace(str, @"\s+", " "); //Remove duplicate spaces. but how can I add the above logic in one sentese? how can I add the above logic part of the regular expression that strips the HTML below? Please help. 1. Function that cleans from html: public static Regex GetRegExpStripHTML() { Regex r = new Regex(@"(<\/?)(?i:(?<element>a(bbr|cronym|ddress|pplet|rea)?|b(ase(f" + @"ont)?|do|ig|lockquote|ody|r|utton)?|c(aption|enter|ite|(o(de" + @"|l(group)?)))|d(d|el|fn|i(r|v)|l|t)|em|f(ieldset|o(nt|rm)|ra" + @"me(set)?)|h([1-6]|ead|r|tml)|i(frame|mg|n(put|s)|sindex)?|kb" + @"d|l(abel|egend|i(nk)?)|m(ap|e(nu|ta))|no(frames|script)|o(bj" + @"ect|l|pt(group|ion))|p(aram|re)?|q|s(amp|cript|elect|mall|pa" + @"n|t(r(ike|ong)|yle)|u(b|p))|t(able|body|d|extarea|foot|h|itl" + @"e|r|t)|u(l)?|var))(\s(?<attr>.+?))*>"); return r; }

      D Offline
      D Offline
      Daniel Grunwald
      wrote on last edited by
      #2

      A regex just searches for a pattern. Assuming the strip regex below is used to replace html tags with the empty string, you cannot use it to replace anything with a space. You can combine two patterns like this: str = Regex.Replace(str, @"(&nbsp;|\s)+", " "); Also, your GetRegExpStripHTML doesn't work. It strips only a few well-formed known tags - that's not enough to prevent cross-site scripting exploits. What about <SCRIPT> or < SCRIPT> ? What about tags you forgot, like <BODY onload="...">? What about encoding the characters using some far-east codepage that your app doesn't understand? The codepage auto-detection of the browser might detect the codepage and execute the scripts. What about null bytes like <SCR\0IPT>? Your regex won't see the script tag, but Internet Explorer still does. What about any of a huge number of other tricks to evade XSS filters? You need to encode any < > &, a blacklist won't get you anywhere as browsers have lots of ways to execute code that you never heard about. And even that isn't 100% safe when playing with the charset tricks. See http://ha.ckers.org/xss.html[^] to get an idea about what kind of attacks on XSS filters are possible.

      N 1 Reply Last reply
      0
      • D Daniel Grunwald

        A regex just searches for a pattern. Assuming the strip regex below is used to replace html tags with the empty string, you cannot use it to replace anything with a space. You can combine two patterns like this: str = Regex.Replace(str, @"(&nbsp;|\s)+", " "); Also, your GetRegExpStripHTML doesn't work. It strips only a few well-formed known tags - that's not enough to prevent cross-site scripting exploits. What about <SCRIPT> or < SCRIPT> ? What about tags you forgot, like <BODY onload="...">? What about encoding the characters using some far-east codepage that your app doesn't understand? The codepage auto-detection of the browser might detect the codepage and execute the scripts. What about null bytes like <SCR\0IPT>? Your regex won't see the script tag, but Internet Explorer still does. What about any of a huge number of other tricks to evade XSS filters? You need to encode any < > &, a blacklist won't get you anywhere as browsers have lots of ways to execute code that you never heard about. And even that isn't 100% safe when playing with the charset tricks. See http://ha.ckers.org/xss.html[^] to get an idea about what kind of attacks on XSS filters are possible.

        N Offline
        N Offline
        nesfrank
        wrote on last edited by
        #3

        the one that strips html is used only for some internbat clean up. that;'s fine. is there anyway I can add the patern you added above to that reg expression? please help?

        D 1 Reply Last reply
        0
        • N nesfrank

          the one that strips html is used only for some internbat clean up. that;'s fine. is there anyway I can add the patern you added above to that reg expression? please help?

          D Offline
          D Offline
          Daniel Grunwald
          wrote on last edited by
          #4

          You could take the huge pattern and append |(&nbsp;|\s)+. But you cannot control what the pattern is replaced with - that's in some other place in the code, where the GetRegExpStripHTML().Replace method is called. I would suggest removing the GetRegExpStripHTML() method and instead providing a CleanupHTML(string) method - that way, you can apply multiple regular expressions and don't have to do everything with a single replacement using a monster pattern.

          1 Reply Last reply
          0
          Reply
          • Reply as topic
          Log in to reply
          • Oldest to Newest
          • Newest to Oldest
          • Most Votes


          • Login

          • Don't have an account? Register

          • Login or register to search.
          • First post
            Last post
          0
          • Categories
          • Recent
          • Tags
          • Popular
          • World
          • Users
          • Groups