Help in Regular expressions

shankbond

Hi, I am new to regular expressions, I am having a problem regarding matching a specific keyword in certain condition.

string regexstring="\\bhtml\\b|[.]net\\b";
Regex rgx = new Regex(regexstring, RegexOptions.IgnoreCase);
MatchCollection matcol = null;
string st_data = "I am a .net developer, but also know about asp.net, vb.net ; I work on c#/asp.net platform. I also know dhtml, html4.0, html/xml etc etc.";
//I want the regex to capture all occureneces for html where html is a word seperated \b-->word boundary or is surrounded by decimals like html4.0 above.
st_data = System.Web.HttpUtility.HtmlDecode(Regex.Replace(st_data, @"<(.|\n)*?>", string.Empty));
matcol = rgx.Matches(st_data);

        foreach (Match mat in matcol)
        {
            //I WILL GET the mat.value here.
        }

I tried various variations but of no use. I want to match html4.0 but some how I need only html out of it.kind of substringed match. I hope You understand my point. Please help any help shall be appreciated.

Thanks Shankbond

OriginalGriff

I'm not sure exactly what you are trying to do, but have a look at match-but-don't-capture groups ( ?: ) Or, explain exactly what you want to achieve and I'll have a look.

Did you know: That by counting the rings on a tree trunk, you can tell how many other trees it has slept with.

shankbond

Hi, I tried looking this stuff, but may be I did something wrong with match-don't capture. 1) I want the regex to match --> html56 but capture only html also 2) match 45html but capture only html, 3) and don't match or capture at all abchtmldef (not surrounded by alphabets one word only) 4) match html I used \b(?:\d*)html(?:\d*)\b It would be nice if someone can help.

Thanks Shankbond

OriginalGriff

The match to do that is quite simple:(?:\d|\s)(?<data>html)(?:\d|\s)

Find (but do not capture) either a digit or a whitespace,
Find and capture in a group called data the four characters 'h', 't', 'm', 'l' in that order,
Find (but do not capture) either a digit or a whitespace.

But I doubt that will solve your problem! What are you trying to achieve? It looks as if you are trying to process a CV and extract all the relevant job skills without manually looking at it. If so, then you may need to be a bit more clever / thorough about it, particularly with a trigger word such as "html" which appears in every web page...

Did you know: That by counting the rings on a tree trunk, you can tell how many other trees it has slept with.

OriginalGriff

Just to add to what I said, go and get a copy of Expresso - it examines and generates Regular expressions. Expresso[^] It's free, and really can help create and understand complicated expressions. You can also feed it a sample file that you want to examine and it will show you what the Regex will capture. I wish I'd written it!

Did you know: That by counting the rings on a tree trunk, you can tell how many other trees it has slept with.

shankbond

Thanks but I already have one :), my query is solved now I got the solution by using

(?<=\\d+(\\.\\d*)?|\\b)html(?=\\d+(\\.\\d*)?|\\b)

but I am having a new query now? (?:.....) is also a non capturing group so I can theoretically use it in place of look ahead and look behind but that does not work here? any solutions?

Thanks Shankbond

shankbond

OriginalGriff wrote:

particularly with a trigger word such as "html" which appears in every web page...

Yes You are absolutly right. I did that with the help of a javascript.

Thanks Shankbond

shankbond

can someone really explain that; I am curious about it.

Thanks Shankbond