Extract certain text from a html page.

Buckleyindahouse

I'm trying to make a username grabber from a game's highscores page. http://hiscore.runescape.com/hiscores.ws[^] The name I want to grab will be in the center where it says "KingDuffy 1" How would I go about doing this? Thanks, Buckley.

Jan Sommer

This is how you extract the source: http://www.experts-exchange.com/Programming/Languages/C\_Sharp/Q\_20739698.html i don't know the best way to grab the usernames, but i would probably put the whole source into a string, substring it by figuring out where the highscorelist starts and ends, and then maybe split the tablerows apart. But that's alot of code, and i think you should look into regular expressions, which might solve your problem in a better way.

Buckleyindahouse

Ok thanks, Im going to try that btw that link is to a site where they will help you if you pay and I don't like that. Does anyone have anymore insight on this? Thanks, Buckley.

Jan Sommer

scroll to the bottom of the page that i linked to :) EDIT: weird, if you come from google you can see the answer.. paste the link onto google.com and visit it from there. then scroll to the bottom.

Buckleyindahouse

Ok I looked at I already know how to get the html source, but I need to retrieve only the username. This is the html source. http://paste-it.net/public/dfe778b/[^] On line 339 contains on of the usernames "Kingduffy 1" but it's not always on line 339 so thats why i need to know how to strip it and retrieve all the usernames on that page.

Bliedtke

What I've done in the past to grab information from a web page is to take the web page returned as a string from the StreamReader.ReadToEnd() method of the StreamReader used to get the web page and break it into an array of HTML tokens. It is pretty starightforward to scan the array to find the data you want. The tokenizer I created to do this is as follows: /// <summary> /// Tokenize the passed string which contains an HTML page into HTML elements /// </summary> /// <param name="InStr">The HTML page to parse.</param> /// <returns>An array of strings that contains the seperate elements of the passed HTML page.</returns> private string[] Tokenize(string InStr) { ArrayList buf = new ArrayList(); int begin = 0, end = 0; bool in_tag = false; while (end != -1) // IndexOf returns -1 when end of string encountered { if (!in_tag) { end = InStr.IndexOf("<", begin); // find index of start of next HTML tag if (begin < end) // if there is length to the token. buf.Add(HttpUtility.HtmlDecode(InStr.Substring(begin, end - begin))); // Add token to list begin = end; in_tag = true; } else { end = InStr.IndexOf(">", begin); // find index of end of HTML tag buf.Add(InStr.Substring(begin, end - begin + 1)); // Add HTML tag to list. begin = end + 1; in_tag = false; } } return ((string[])buf.ToArray(typeof(string))); }

Bliedtke

Whoops. The posting converted the '<' and '>' characters to the HTML equivalent '>' and '<' repectively making this hard to read. Instead of cluttering this up with posting a new snippet email me at liedtke@frii.com if you want the code. Brian

Bliedtke

Joshua, Your email address is bouncing. It is the gmail.com account. Re-email me with a valid address. Brian