regular expression cleanup!
-
hi guys, I need to clean up a string what this function below does but sometimes this is too slow and I have been told that for best performance regular expression will do much a faster job. Is there anyway to write a regulatr expression that cleans what the function below does? Please help guys! Thanks Frank private string CleanString(string strValue) { // See Above CString() Function for ASCii Denotations. string strReturn = strValue; for(int x=0; x < strValue.Length; x++) { char charTmp = strValue[x]; string strTmp = charTmp.ToString(); int ASCii = (int)charTmp; // The following blocks system characters (<= 31) save for the // CR (10) / LF (13) needed to preserve Line Returns. It also clears // some specific problematic characters, such as double quotes ("), // semi-colons (;) and html-tag delimiters (<) and (>). if ( (ASCii <=9) || (ASCii == 11) || (ASCii == 12) || ( (ASCii >=14) && (ASCii <=31) ) || (ASCii == 34) || (ASCii == 59) || (ASCii == 60) || (ASCii == 62) || (ASCii == 124) ) { // Strip out these characters as they're encountered... strReturn = strReturn.Replace(strTmp, ""); } // This next step eliminates the ever-annoying shift-space character // as well as a large range of non-used symbol and system characters. if ( ((ASCii >=127) && (ASCii >= 191)) || (ASCii == 215) || (ASCii == 247) ) { // Strip out these characters... strReturn = strReturn.Replace(strTmp, ""); } } return strReturn; }
-
hi guys, I need to clean up a string what this function below does but sometimes this is too slow and I have been told that for best performance regular expression will do much a faster job. Is there anyway to write a regulatr expression that cleans what the function below does? Please help guys! Thanks Frank private string CleanString(string strValue) { // See Above CString() Function for ASCii Denotations. string strReturn = strValue; for(int x=0; x < strValue.Length; x++) { char charTmp = strValue[x]; string strTmp = charTmp.ToString(); int ASCii = (int)charTmp; // The following blocks system characters (<= 31) save for the // CR (10) / LF (13) needed to preserve Line Returns. It also clears // some specific problematic characters, such as double quotes ("), // semi-colons (;) and html-tag delimiters (<) and (>). if ( (ASCii <=9) || (ASCii == 11) || (ASCii == 12) || ( (ASCii >=14) && (ASCii <=31) ) || (ASCii == 34) || (ASCii == 59) || (ASCii == 60) || (ASCii == 62) || (ASCii == 124) ) { // Strip out these characters as they're encountered... strReturn = strReturn.Replace(strTmp, ""); } // This next step eliminates the ever-annoying shift-space character // as well as a large range of non-used symbol and system characters. if ( ((ASCii >=127) && (ASCii >= 191)) || (ASCii == 215) || (ASCii == 247) ) { // Strip out these characters... strReturn = strReturn.Replace(strTmp, ""); } } return strReturn; }
a regexp like [\x00-\x09\x0B-\x0C\x0E-\x1F\x22\x3B\x3C\x3E\x7C] which is equivalent with if ( (ASCii <=9) || (ASCii == 11) || (ASCii == 12) || ( (ASCii >=14) && (ASCii <=31) ) || (ASCii == 34) || (ASCii == 59) || (ASCii == 60) || (ASCii == 62) || (ASCii == 124) )... hope i didn`t omit any of the ascii numbers or miscalculated their value :sigh:
Code? Yeah i love it fried together with a glass of wine.
-
hi guys, I need to clean up a string what this function below does but sometimes this is too slow and I have been told that for best performance regular expression will do much a faster job. Is there anyway to write a regulatr expression that cleans what the function below does? Please help guys! Thanks Frank private string CleanString(string strValue) { // See Above CString() Function for ASCii Denotations. string strReturn = strValue; for(int x=0; x < strValue.Length; x++) { char charTmp = strValue[x]; string strTmp = charTmp.ToString(); int ASCii = (int)charTmp; // The following blocks system characters (<= 31) save for the // CR (10) / LF (13) needed to preserve Line Returns. It also clears // some specific problematic characters, such as double quotes ("), // semi-colons (;) and html-tag delimiters (<) and (>). if ( (ASCii <=9) || (ASCii == 11) || (ASCii == 12) || ( (ASCii >=14) && (ASCii <=31) ) || (ASCii == 34) || (ASCii == 59) || (ASCii == 60) || (ASCii == 62) || (ASCii == 124) ) { // Strip out these characters as they're encountered... strReturn = strReturn.Replace(strTmp, ""); } // This next step eliminates the ever-annoying shift-space character // as well as a large range of non-used symbol and system characters. if ( ((ASCii >=127) && (ASCii >= 191)) || (ASCii == 215) || (ASCii == 247) ) { // Strip out these characters... strReturn = strReturn.Replace(strTmp, ""); } } return strReturn; }
Looking at this gives me flashbacks... You should think about using a regex util. I use 'The Regex Coach.' If it's a performance increase you should first consider rewriting the nasty looking conditional(s). Be aware that regular expressions can be a huge pain (and hurt performance) if done poorly. Scott P
"Simplicity carried to the extreme becomes elegance."
-Jon Franklin -
hi guys, I need to clean up a string what this function below does but sometimes this is too slow and I have been told that for best performance regular expression will do much a faster job. Is there anyway to write a regulatr expression that cleans what the function below does? Please help guys! Thanks Frank private string CleanString(string strValue) { // See Above CString() Function for ASCii Denotations. string strReturn = strValue; for(int x=0; x < strValue.Length; x++) { char charTmp = strValue[x]; string strTmp = charTmp.ToString(); int ASCii = (int)charTmp; // The following blocks system characters (<= 31) save for the // CR (10) / LF (13) needed to preserve Line Returns. It also clears // some specific problematic characters, such as double quotes ("), // semi-colons (;) and html-tag delimiters (<) and (>). if ( (ASCii <=9) || (ASCii == 11) || (ASCii == 12) || ( (ASCii >=14) && (ASCii <=31) ) || (ASCii == 34) || (ASCii == 59) || (ASCii == 60) || (ASCii == 62) || (ASCii == 124) ) { // Strip out these characters as they're encountered... strReturn = strReturn.Replace(strTmp, ""); } // This next step eliminates the ever-annoying shift-space character // as well as a large range of non-used symbol and system characters. if ( ((ASCii >=127) && (ASCii >= 191)) || (ASCii == 215) || (ASCii == 247) ) { // Strip out these characters... strReturn = strReturn.Replace(strTmp, ""); } } return strReturn; }
nesfrank wrote:
this is too slow
nesfrank wrote:
best performance
I think first of all you need to quantify what you mean here and have a way of timing a simple loop to cleanup the strings. I would have done much the same but in c++ I would have done something like :-
template <class T>
class is_garbage: public unary_function<char, bool>
{
public:
bool operator ()(T t) const
{
return ( !(isalpha( t ) || isdigit( t )));
}
};std::string clean(const std::string& val)
{
string tmp = val;// Clean Up The String
string::iterator it0= remove_if(tmp.begin(), tmp.end(), is_garbage<char>());// Assign To Clean String string result(tmp.begin(),it0); return result;
}
Which isnt that much different from what you've got, a bit more readable maybe :-) (NB : you would have to alter 'is_garbage' to match what you want, this is straight from my code, Ive made no attempt to replicate your issue) Are you actually searching for strings within strings, replacing etc ? no ?? then I personally think a regex is an overkill ... Ive copied this excerpt from John Maddock's page where he has the Regex++ library, although its been in Boost for a while now :- "Regular expression libraries use a variety of differing algorithms all of which have their pro's and con's, which can make it hard to choose the best implementation for your purpose. This library uses an NFA algorithm which is dedicated to determining fast and accurate sub-expression matches, as well as providing support for features like back-references, which are hard to support using DFA algorithms. People who should use this library: Anyone who needs to use wide character strings. Anyone who needs to search non-contiguous data. Anyone who wants fast sub-expression matching. Anyone who wants to customise the regular expression behaviour, or localise the library to a non-English locale. People who should look to an alternative DFA based library: Anyone who doesn't care about sub-expression matching, and Wants to search only ANSI-C strings. " ref http://ourworld.compuserve.com/homepages/john_maddock/regexpp.htm[^] 'g'
-
hi guys, I need to clean up a string what this function below does but sometimes this is too slow and I have been told that for best performance regular expression will do much a faster job. Is there anyway to write a regulatr expression that cleans what the function below does? Please help guys! Thanks Frank private string CleanString(string strValue) { // See Above CString() Function for ASCii Denotations. string strReturn = strValue; for(int x=0; x < strValue.Length; x++) { char charTmp = strValue[x]; string strTmp = charTmp.ToString(); int ASCii = (int)charTmp; // The following blocks system characters (<= 31) save for the // CR (10) / LF (13) needed to preserve Line Returns. It also clears // some specific problematic characters, such as double quotes ("), // semi-colons (;) and html-tag delimiters (<) and (>). if ( (ASCii <=9) || (ASCii == 11) || (ASCii == 12) || ( (ASCii >=14) && (ASCii <=31) ) || (ASCii == 34) || (ASCii == 59) || (ASCii == 60) || (ASCii == 62) || (ASCii == 124) ) { // Strip out these characters as they're encountered... strReturn = strReturn.Replace(strTmp, ""); } // This next step eliminates the ever-annoying shift-space character // as well as a large range of non-used symbol and system characters. if ( ((ASCii >=127) && (ASCii >= 191)) || (ASCii == 215) || (ASCii == 247) ) { // Strip out these characters... strReturn = strReturn.Replace(strTmp, ""); } } return strReturn; }
I think you should forget about regular expressions as a solution to your problem. You really should consider cleaning up and re-working your code first... at least for the learning experience. Learning how to do these types of manipulations is pretty important and, looking at your code, you're just not there, yet. All those temp variables and creating new strings in the loops area really expensive. Learn how strings work. Learn what immutable means and what happens when you build strings repeatedly within a loop. As a first step, start from the basics. Learn to traverse a string and manipulate it character by character (as you attempted above). Start with something like this:
private string CleanString(string dirtyString)
{
StringBuilder cleanString = new StringBuilder(); // Learn what this does and why to use it
foreach (char c in dirtyString)
{
// Note: C# strings are made up of 2-byte Unicode/UTF-16 characters, not ASCII characters.
if ((c != '\u0009') || (c != '\u000B') ... etc. )
{
// if character is not dirty, add it to the new string
cleanString.Append(c);
}
}
return (cleanString.ToString());
}Get that working, but then start using .NET's built in methods to improve your code. Next, read about
string.IndexOf(char)
so you can search the entire string at once for a character. Rewrite your code and get that working. Then, try creating an array of "dirty characters" so you can search for them all at once. Start by reading about this stuff:char[] dirtyChars = new char[] { '\u0009', '\u000B', ... etc. };
int dirtyIndex = dirtyString.IndexOfAny(dirtyChars);Then rewrite your code again and get it working. Then read about regular expressions, if you're curious. Will regular expressions work better? Maybe marginally... that's a really small "maybe." Probably not enough to matter. More readable?... I doubt it. Enjoy, Robert C. Cartaino
modified on Wednesday, November 19, 2008 5:24 PM