Regex weakling help

afronaut

Hey all, Work is really boring so I'm going to write a screen scraper. But the best say to do this is using Regex's, something I need to work on because it's a particular area of weakness. If a page has the following structure: foo

foo

Is there a regex I could use to pick up what's between the tags? Like a regex to grab the title, another for the first table, and the second? Thanks much *->>Always working on my game, teach me *->>something new. cout << "dav1d\n";

Nick Parker

Sure, here is a quick example so I am sure you can expand on it:

private void ShowContent(string s)
{
	Regex r = new Regex("\*\[a-z\]\*", RegexOptions.IgnoreCase);
	Match m = r.Match(s);
	while(m.Success)
	{		
		string val = m.Value.Delete(0, 4).Delete(m.Value.Length - 4, 4);
		if(val != null)
			Console.WriteLine(v);
		m = m.NextMatch();
	}
}

- Nick Parker
My Blog | My Articles

Heath Stewart

* is not a wildcard, though - you should actually just use "<td>[A-Za-z0-9]*</td>", which means that 0 or more alphanumeric characters (there are excape sequences you can use, too) are allowed between TD elements. What you have now will match 0 or more openning TD elements as well. This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles]

Heath Stewart

Regex can be horribly unreliable and a complete pain when unforeseen formats creep up. I recommend using SgmlReader[^] written by a fellow Microsoftie. HTML is, if you don't know, an SGML grammar, as is XML and XHTML (which is actually an XML grammar that only looks like HTML because it uses the XHTML namespace as the default namespace so that namespace prefices aren't required). This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles]