Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. regular expression cleanup!

regular expression cleanup!

Scheduled Pinned Locked Moved C#
htmlregexperformancehelpquestion
5 Posts 5 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • N Offline
    N Offline
    nesfrank
    wrote on last edited by
    #1

    hi guys, I need to clean up a string what this function below does but sometimes this is too slow and I have been told that for best performance regular expression will do much a faster job. Is there anyway to write a regulatr expression that cleans what the function below does? Please help guys! Thanks Frank private string CleanString(string strValue) { // See Above CString() Function for ASCii Denotations. string strReturn = strValue; for(int x=0; x < strValue.Length; x++) { char charTmp = strValue[x]; string strTmp = charTmp.ToString(); int ASCii = (int)charTmp; // The following blocks system characters (<= 31) save for the // CR (10) / LF (13) needed to preserve Line Returns. It also clears // some specific problematic characters, such as double quotes ("), // semi-colons (;) and html-tag delimiters (<) and (>). if ( (ASCii <=9) || (ASCii == 11) || (ASCii == 12) || ( (ASCii >=14) && (ASCii <=31) ) || (ASCii == 34) || (ASCii == 59) || (ASCii == 60) || (ASCii == 62) || (ASCii == 124) ) { // Strip out these characters as they're encountered... strReturn = strReturn.Replace(strTmp, ""); } // This next step eliminates the ever-annoying shift-space character // as well as a large range of non-used symbol and system characters. if ( ((ASCii >=127) && (ASCii >= 191)) || (ASCii == 215) || (ASCii == 247) ) { // Strip out these characters... strReturn = strReturn.Replace(strTmp, ""); } } return strReturn; }

    S C G R 4 Replies Last reply
    0
    • N nesfrank

      hi guys, I need to clean up a string what this function below does but sometimes this is too slow and I have been told that for best performance regular expression will do much a faster job. Is there anyway to write a regulatr expression that cleans what the function below does? Please help guys! Thanks Frank private string CleanString(string strValue) { // See Above CString() Function for ASCii Denotations. string strReturn = strValue; for(int x=0; x < strValue.Length; x++) { char charTmp = strValue[x]; string strTmp = charTmp.ToString(); int ASCii = (int)charTmp; // The following blocks system characters (<= 31) save for the // CR (10) / LF (13) needed to preserve Line Returns. It also clears // some specific problematic characters, such as double quotes ("), // semi-colons (;) and html-tag delimiters (<) and (>). if ( (ASCii <=9) || (ASCii == 11) || (ASCii == 12) || ( (ASCii >=14) && (ASCii <=31) ) || (ASCii == 34) || (ASCii == 59) || (ASCii == 60) || (ASCii == 62) || (ASCii == 124) ) { // Strip out these characters as they're encountered... strReturn = strReturn.Replace(strTmp, ""); } // This next step eliminates the ever-annoying shift-space character // as well as a large range of non-used symbol and system characters. if ( ((ASCii >=127) && (ASCii >= 191)) || (ASCii == 215) || (ASCii == 247) ) { // Strip out these characters... strReturn = strReturn.Replace(strTmp, ""); } } return strReturn; }

      S Offline
      S Offline
      sph3rex
      wrote on last edited by
      #2

      a regexp like [\x00-\x09\x0B-\x0C\x0E-\x1F\x22\x3B\x3C\x3E\x7C] which is equivalent with if ( (ASCii <=9) || (ASCii == 11) || (ASCii == 12) || ( (ASCii >=14) && (ASCii <=31) ) || (ASCii == 34) || (ASCii == 59) || (ASCii == 60) || (ASCii == 62) || (ASCii == 124) )... hope i didn`t omit any of the ascii numbers or miscalculated their value :sigh:

      Code? Yeah i love it fried together with a glass of wine.

      1 Reply Last reply
      0
      • N nesfrank

        hi guys, I need to clean up a string what this function below does but sometimes this is too slow and I have been told that for best performance regular expression will do much a faster job. Is there anyway to write a regulatr expression that cleans what the function below does? Please help guys! Thanks Frank private string CleanString(string strValue) { // See Above CString() Function for ASCii Denotations. string strReturn = strValue; for(int x=0; x < strValue.Length; x++) { char charTmp = strValue[x]; string strTmp = charTmp.ToString(); int ASCii = (int)charTmp; // The following blocks system characters (<= 31) save for the // CR (10) / LF (13) needed to preserve Line Returns. It also clears // some specific problematic characters, such as double quotes ("), // semi-colons (;) and html-tag delimiters (<) and (>). if ( (ASCii <=9) || (ASCii == 11) || (ASCii == 12) || ( (ASCii >=14) && (ASCii <=31) ) || (ASCii == 34) || (ASCii == 59) || (ASCii == 60) || (ASCii == 62) || (ASCii == 124) ) { // Strip out these characters as they're encountered... strReturn = strReturn.Replace(strTmp, ""); } // This next step eliminates the ever-annoying shift-space character // as well as a large range of non-used symbol and system characters. if ( ((ASCii >=127) && (ASCii >= 191)) || (ASCii == 215) || (ASCii == 247) ) { // Strip out these characters... strReturn = strReturn.Replace(strTmp, ""); } } return strReturn; }

        C Offline
        C Offline
        carbon_golem
        wrote on last edited by
        #3

        Looking at this gives me flashbacks... You should think about using a regex util. I use 'The Regex Coach.' If it's a performance increase you should first consider rewriting the nasty looking conditional(s). Be aware that regular expressions can be a huge pain (and hurt performance) if done poorly. Scott P

        "Simplicity carried to the extreme becomes elegance."
        -Jon Franklin

        1 Reply Last reply
        0
        • N nesfrank

          hi guys, I need to clean up a string what this function below does but sometimes this is too slow and I have been told that for best performance regular expression will do much a faster job. Is there anyway to write a regulatr expression that cleans what the function below does? Please help guys! Thanks Frank private string CleanString(string strValue) { // See Above CString() Function for ASCii Denotations. string strReturn = strValue; for(int x=0; x < strValue.Length; x++) { char charTmp = strValue[x]; string strTmp = charTmp.ToString(); int ASCii = (int)charTmp; // The following blocks system characters (<= 31) save for the // CR (10) / LF (13) needed to preserve Line Returns. It also clears // some specific problematic characters, such as double quotes ("), // semi-colons (;) and html-tag delimiters (<) and (>). if ( (ASCii <=9) || (ASCii == 11) || (ASCii == 12) || ( (ASCii >=14) && (ASCii <=31) ) || (ASCii == 34) || (ASCii == 59) || (ASCii == 60) || (ASCii == 62) || (ASCii == 124) ) { // Strip out these characters as they're encountered... strReturn = strReturn.Replace(strTmp, ""); } // This next step eliminates the ever-annoying shift-space character // as well as a large range of non-used symbol and system characters. if ( ((ASCii >=127) && (ASCii >= 191)) || (ASCii == 215) || (ASCii == 247) ) { // Strip out these characters... strReturn = strReturn.Replace(strTmp, ""); } } return strReturn; }

          G Offline
          G Offline
          Garth J Lancaster
          wrote on last edited by
          #4

          nesfrank wrote:

          this is too slow

          nesfrank wrote:

          best performance

          I think first of all you need to quantify what you mean here and have a way of timing a simple loop to cleanup the strings. I would have done much the same but in c++ I would have done something like :-

          template <class T>
          class is_garbage: public unary_function<char, bool>
          {
          public:
          bool operator ()(T t) const
          {
          return ( !(isalpha( t ) || isdigit( t )));
          }
          };

          std::string clean(const std::string& val)
          {
          string tmp = val;

          // Clean Up The String
          string::iterator it0= remove_if(tmp.begin(), tmp.end(), is_garbage<char>());

          // Assign To Clean String
          string result(tmp.begin(),it0);
          
          return result;
          

          }

          Which isnt that much different from what you've got, a bit more readable maybe :-) (NB : you would have to alter 'is_garbage' to match what you want, this is straight from my code, Ive made no attempt to replicate your issue) Are you actually searching for strings within strings, replacing etc ? no ?? then I personally think a regex is an overkill ... Ive copied this excerpt from John Maddock's page where he has the Regex++ library, although its been in Boost for a while now :- "Regular expression libraries use a variety of differing algorithms all of which have their pro's and con's, which can make it hard to choose the best implementation for your purpose. This library uses an NFA algorithm which is dedicated to determining fast and accurate sub-expression matches, as well as providing support for features like back-references, which are hard to support using DFA algorithms. People who should use this library: Anyone who needs to use wide character strings. Anyone who needs to search non-contiguous data. Anyone who wants fast sub-expression matching. Anyone who wants to customise the regular expression behaviour, or localise the library to a non-English locale. People who should look to an alternative DFA based library: Anyone who doesn't care about sub-expression matching, and Wants to search only ANSI-C strings. " ref http://ourworld.compuserve.com/homepages/john_maddock/regexpp.htm[^] 'g'

          1 Reply Last reply
          0
          • N nesfrank

            hi guys, I need to clean up a string what this function below does but sometimes this is too slow and I have been told that for best performance regular expression will do much a faster job. Is there anyway to write a regulatr expression that cleans what the function below does? Please help guys! Thanks Frank private string CleanString(string strValue) { // See Above CString() Function for ASCii Denotations. string strReturn = strValue; for(int x=0; x < strValue.Length; x++) { char charTmp = strValue[x]; string strTmp = charTmp.ToString(); int ASCii = (int)charTmp; // The following blocks system characters (<= 31) save for the // CR (10) / LF (13) needed to preserve Line Returns. It also clears // some specific problematic characters, such as double quotes ("), // semi-colons (;) and html-tag delimiters (<) and (>). if ( (ASCii <=9) || (ASCii == 11) || (ASCii == 12) || ( (ASCii >=14) && (ASCii <=31) ) || (ASCii == 34) || (ASCii == 59) || (ASCii == 60) || (ASCii == 62) || (ASCii == 124) ) { // Strip out these characters as they're encountered... strReturn = strReturn.Replace(strTmp, ""); } // This next step eliminates the ever-annoying shift-space character // as well as a large range of non-used symbol and system characters. if ( ((ASCii >=127) && (ASCii >= 191)) || (ASCii == 215) || (ASCii == 247) ) { // Strip out these characters... strReturn = strReturn.Replace(strTmp, ""); } } return strReturn; }

            R Offline
            R Offline
            Robert C Cartaino
            wrote on last edited by
            #5

            I think you should forget about regular expressions as a solution to your problem. You really should consider cleaning up and re-working your code first... at least for the learning experience. Learning how to do these types of manipulations is pretty important and, looking at your code, you're just not there, yet. All those temp variables and creating new strings in the loops area really expensive. Learn how strings work. Learn what immutable means and what happens when you build strings repeatedly within a loop. As a first step, start from the basics. Learn to traverse a string and manipulate it character by character (as you attempted above). Start with something like this:

            private string CleanString(string dirtyString)
            {
            StringBuilder cleanString = new StringBuilder(); // Learn what this does and why to use it
            foreach (char c in dirtyString)
            {
            // Note: C# strings are made up of 2-byte Unicode/UTF-16 characters, not ASCII characters.
            if ((c != '\u0009') || (c != '\u000B') ... etc. )
            {
            // if character is not dirty, add it to the new string
            cleanString.Append(c);
            }
            }
            return (cleanString.ToString());
            }

            Get that working, but then start using .NET's built in methods to improve your code. Next, read about string.IndexOf(char) so you can search the entire string at once for a character. Rewrite your code and get that working. Then, try creating an array of "dirty characters" so you can search for them all at once. Start by reading about this stuff:

            char[] dirtyChars = new char[] { '\u0009', '\u000B', ... etc. };
            int dirtyIndex = dirtyString.IndexOfAny(dirtyChars);

            Then rewrite your code again and get it working. Then read about regular expressions, if you're curious. Will regular expressions work better? Maybe marginally... that's a really small "maybe." Probably not enough to matter. More readable?... I doubt it. Enjoy, Robert C. Cartaino

            modified on Wednesday, November 19, 2008 5:24 PM

            1 Reply Last reply
            0
            Reply
            • Reply as topic
            Log in to reply
            • Oldest to Newest
            • Newest to Oldest
            • Most Votes


            • Login

            • Don't have an account? Register

            • Login or register to search.
            • First post
              Last post
            0
            • Categories
            • Recent
            • Tags
            • Popular
            • World
            • Users
            • Groups