Html parser
-
I am looking for any html parser, that could output xml from inputstream(IMDB search results) or just parsed code into structures with filter tags. I tryed with HTMLCleaner but it doesn't supports imdb site. I get this error
Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.imdb.com/
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
at java.net.URL.openStream(URL.java:1010)
at org.htmlcleaner.Utils.getCharsetFromContent(Utils.java:121)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:299)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:317)
at Main.main(Main.java:25)I also tryed HTMLParser([^]), but i can't get correct data with it. If anyone has experience with parsing IMDB html code i would be really thankful for any kind of help. Thanks
-
I am looking for any html parser, that could output xml from inputstream(IMDB search results) or just parsed code into structures with filter tags. I tryed with HTMLCleaner but it doesn't supports imdb site. I get this error
Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.imdb.com/
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
at java.net.URL.openStream(URL.java:1010)
at org.htmlcleaner.Utils.getCharsetFromContent(Utils.java:121)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:299)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:317)
at Main.main(Main.java:25)I also tryed HTMLParser([^]), but i can't get correct data with it. If anyone has experience with parsing IMDB html code i would be really thankful for any kind of help. Thanks
Since you haven't shown any code helping you seems futile, but I'm sure you have checked the meaning of HTTP return code of 403: http://en.wikipedia.org/wiki/HTTP_403[^]. Just a well meant hint. Cheers!
-
I am looking for any html parser, that could output xml from inputstream(IMDB search results) or just parsed code into structures with filter tags. I tryed with HTMLCleaner but it doesn't supports imdb site. I get this error
Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.imdb.com/
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
at java.net.URL.openStream(URL.java:1010)
at org.htmlcleaner.Utils.getCharsetFromContent(Utils.java:121)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:299)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:317)
at Main.main(Main.java:25)I also tryed HTMLParser([^]), but i can't get correct data with it. If anyone has experience with parsing IMDB html code i would be really thankful for any kind of help. Thanks
Hi, 403 means "forbidden", which could be many things, however it is decided by the server, and the net result is you aren't getting any data. So it is not the parsing that is at fault, it is the way you ask for the web page. I tried http://www.imdb.com with my existing C# program and it loads fine; one thing I remember very well doing after some sporadic failures, is provide a realistic "useragent", which is a string explaining what the client's characteristics/capabilities are. I use "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.19) Gecko/2010031422 Firefox/3.0.17" which was what FireFox emitted at that time. I suggest you figure out where and how to specify such useragent in your code. :)
Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.
-
Hi, 403 means "forbidden", which could be many things, however it is decided by the server, and the net result is you aren't getting any data. So it is not the parsing that is at fault, it is the way you ask for the web page. I tried http://www.imdb.com with my existing C# program and it loads fine; one thing I remember very well doing after some sporadic failures, is provide a realistic "useragent", which is a string explaining what the client's characteristics/capabilities are. I use "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.19) Gecko/2010031422 Firefox/3.0.17" which was what FireFox emitted at that time. I suggest you figure out where and how to specify such useragent in your code. :)
Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.
The code is like this:
CleanerProperties props = new CleanerProperties();
HtmlCleaner test=new HtmlCleaner();
test.clean(new URL("http://www.imdb.com/find?s=all&q=burek"));In c# i have no problems too. But in java there i have errors, which i specified. Any other parser, that would be useful for IMDB? Thanks
-
The code is like this:
CleanerProperties props = new CleanerProperties();
HtmlCleaner test=new HtmlCleaner();
test.clean(new URL("http://www.imdb.com/find?s=all&q=burek"));In c# i have no problems too. But in java there i have errors, which i specified. Any other parser, that would be useful for IMDB? Thanks
My C# code doesn't work for that URL, i.e. it seems to return only half a HTML header and no body; there is a link tag though. My FF browser works, however its "view page source" shows exactly the same stuff my C# app does. I'm puzzled by the link tag.
<link rel="canonical" href="http://www.imdb.com/find?s=all&q=burek" />
the "canonical" value is unknown in here[^]!!! There are google hits about it though... :)
Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.
-
My C# code doesn't work for that URL, i.e. it seems to return only half a HTML header and no body; there is a link tag though. My FF browser works, however its "view page source" shows exactly the same stuff my C# app does. I'm puzzled by the link tag.
<link rel="canonical" href="http://www.imdb.com/find?s=all&q=burek" />
the "canonical" value is unknown in here[^]!!! There are google hits about it though... :)
Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.
I used another way of parsing it.. xml serialization with imdb doesn't work.. so i am doing it with TagNodes that HtmlParsers supports and its quite easy! Maybe you know how to replace this spec char which i am getting
"""
because with replace it doesn't work?! Thanks
modified on Monday, March 14, 2011 11:42 PM
-
I am looking for any html parser, that could output xml from inputstream(IMDB search results) or just parsed code into structures with filter tags. I tryed with HTMLCleaner but it doesn't supports imdb site. I get this error
Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.imdb.com/
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
at java.net.URL.openStream(URL.java:1010)
at org.htmlcleaner.Utils.getCharsetFromContent(Utils.java:121)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:299)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:317)
at Main.main(Main.java:25)I also tryed HTMLParser([^]), but i can't get correct data with it. If anyone has experience with parsing IMDB html code i would be really thankful for any kind of help. Thanks
I would hazard a guess that the 403 Forbidden error is the result of IMDB not allowing their web interfaces to be used as a web service (querying for data directly without viewing the content on their site, including the all-important advertising :)). They likely accomplish this with a bizarre browser behaviour trick, as Luc and yourself seem to have seen with the strange canonical link tag. You may want to look into if IMDB hosts a query interface for applications, but if they do, it's likely a premium service (AKA a paid service). Cheers!