Getting XHTML tags by tag name
-
In JavaScript you can run document.getElementsByTagName ("img") to get all of the image tags. Can you do something similar in C#? Also you can do this is JavaScript:
var image = document.createElement ("img");
image.getAttribute ("src");Again can you do something similar in C#? And if so, how? This is how I'm getting the webpage: http://www.tech-recipes.com/rx/1954/get_web_page_contents_in_code_with_csharp/[^] Thanks.
-
In JavaScript you can run document.getElementsByTagName ("img") to get all of the image tags. Can you do something similar in C#? Also you can do this is JavaScript:
var image = document.createElement ("img");
image.getAttribute ("src");Again can you do something similar in C#? And if so, how? This is how I'm getting the webpage: http://www.tech-recipes.com/rx/1954/get_web_page_contents_in_code_with_csharp/[^] Thanks.
If it's really XHTML and not just plain old HTML, you could just treat it as xml: use XmlDocument and its
SelectNodes(string xpath)
function At least, that's what I would do.. The xpath would be something like"//img/@src"
I think (if you want all src attributes of all img's as your code seems to do) -
If it's really XHTML and not just plain old HTML, you could just treat it as xml: use XmlDocument and its
SelectNodes(string xpath)
function At least, that's what I would do.. The xpath would be something like"//img/@src"
I think (if you want all src attributes of all img's as your code seems to do)Okay I'll give that a try. I've found out that the XHTML isn't completely valid. Some of the tags aren't closed properly. Here's in excerpt:
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=UTF-8">
<meta name="robots" content="noarchive"/>
<meta name="description" content="/a/ is 4chan's imageboard dedicated to the discussion of Japanese anime and manga."/>
<meta name="keywords" content="imageboard,japan,anime,manga"/><link rel="alternate stylesheet" type="text/css" href="http://zip.4chan.org/yotsuba.9.css" title="Yotsuba"><link rel="stylesheet" type="text/css" href="http://zip.4chan.org/yotsublue.9.css" title="Yotsuba B"><link rel="alternate stylesheet" type="text/css" href="http://zip.4chan.org/futaba.9.css" title="Futaba"><link rel="alternate stylesheet" type="text/css" href="http://zip.4chan.org/burichan.9.css" title="Burichan"><link rel="alternate" title="RSS feed" href="/a/index.rss" type="application/rss+xml" /><title>/a/ - Animu & Mango</title>While some of the tags are somewhat formed properly, some aren't. The first one is easy:
result.Replace ("\"/>", "\" />");
I think I could use regex for the tags missing a closing "/" but I don't know how to do that. [Edit] Okay the result.Replace bit isn't working.
modified on Thursday, June 25, 2009 1:18 PM
-
Okay I'll give that a try. I've found out that the XHTML isn't completely valid. Some of the tags aren't closed properly. Here's in excerpt:
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=UTF-8">
<meta name="robots" content="noarchive"/>
<meta name="description" content="/a/ is 4chan's imageboard dedicated to the discussion of Japanese anime and manga."/>
<meta name="keywords" content="imageboard,japan,anime,manga"/><link rel="alternate stylesheet" type="text/css" href="http://zip.4chan.org/yotsuba.9.css" title="Yotsuba"><link rel="stylesheet" type="text/css" href="http://zip.4chan.org/yotsublue.9.css" title="Yotsuba B"><link rel="alternate stylesheet" type="text/css" href="http://zip.4chan.org/futaba.9.css" title="Futaba"><link rel="alternate stylesheet" type="text/css" href="http://zip.4chan.org/burichan.9.css" title="Burichan"><link rel="alternate" title="RSS feed" href="/a/index.rss" type="application/rss+xml" /><title>/a/ - Animu & Mango</title>While some of the tags are somewhat formed properly, some aren't. The first one is easy:
result.Replace ("\"/>", "\" />");
I think I could use regex for the tags missing a closing "/" but I don't know how to do that. [Edit] Okay the result.Replace bit isn't working.
modified on Thursday, June 25, 2009 1:18 PM
-
I'm afraid you may have to use a custom parser, for HTML.. That will work, but it's a lot of work to make.
-
Well all of the HTML seems to be properly nested as per the XHTML specs, but some of the tags simply aren't closed properly. All I may need to do is C#'s version of PHP's preg_replace function.
-
If you can get that to work then it's probably less work, but it may not be as robust. Up to you though :)