Getting XHTML tags by tag name

Jordanwb

In JavaScript you can run document.getElementsByTagName ("img") to get all of the image tags. Can you do something similar in C#? Also you can do this is JavaScript:

var image = document.createElement ("img");
image.getAttribute ("src");

Again can you do something similar in C#? And if so, how? This is how I'm getting the webpage: http://www.tech-recipes.com/rx/1954/get_web_page_contents_in_code_with_csharp/[^] Thanks.

Lost User

If it's really XHTML and not just plain old HTML, you could just treat it as xml: use XmlDocument and its SelectNodes(string xpath) function At least, that's what I would do.. The xpath would be something like "//img/@src" I think (if you want all src attributes of all img's as your code seems to do)

Jordanwb

Okay I'll give that a try. I've found out that the XHTML isn't completely valid. Some of the tags aren't closed properly. Here's in excerpt:

<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=UTF-8">
<meta name="robots" content="noarchive"/>
<meta name="description" content="/a/ is 4chan's imageboard dedicated to the discussion of Japanese anime and manga."/>
<meta name="keywords" content="imageboard,japan,anime,manga"/><link rel="alternate stylesheet" type="text/css" href="http://zip.4chan.org/yotsuba.9.css" title="Yotsuba"><link rel="stylesheet" type="text/css" href="http://zip.4chan.org/yotsublue.9.css" title="Yotsuba B"><link rel="alternate stylesheet" type="text/css" href="http://zip.4chan.org/futaba.9.css" title="Futaba"><link rel="alternate stylesheet" type="text/css" href="http://zip.4chan.org/burichan.9.css" title="Burichan"><link rel="alternate" title="RSS feed" href="/a/index.rss" type="application/rss+xml" /><title>/a/ - Animu & Mango</title>

While some of the tags are somewhat formed properly, some aren't. The first one is easy:

result.Replace ("\"/>", "\" />");

I think I could use regex for the tags missing a closing "/" but I don't know how to do that. [Edit] Okay the result.Replace bit isn't working.

modified on Thursday, June 25, 2009 1:18 PM

Lost User · modified on Thursday, June 25, 2009 1:18 PM

I'm afraid you may have to use a custom parser, for HTML.. That will work, but it's a lot of work to make.

Jordanwb

Well all of the HTML seems to be properly nested as per the XHTML specs, but some of the tags simply aren't closed properly. All I may need to do is C#'s version of PHP's preg_replace function.

Lost User

If you can get that to work then it's probably less work, but it may not be as robust. Up to you though :)

Jordanwb

I found more malformed HTML on other boards. It seems that my program will be significantly more complicated than I though. :( Putting a XHTML transitional doctype creates 431 errors just on one thread alone.