Html parser

Aljaz111

I am looking for any html parser, that could output xml from inputstream(IMDB search results) or just parsed code into structures with filter tags. I tryed with HTMLCleaner but it doesn't supports imdb site. I get this error

Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.imdb.com/
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
at java.net.URL.openStream(URL.java:1010)
at org.htmlcleaner.Utils.getCharsetFromContent(Utils.java:121)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:299)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:317)
at Main.main(Main.java:25)

I also tryed HTMLParser([^]), but i can't get correct data with it. If anyone has experience with parsing IMDB html code i would be really thankful for any kind of help. Thanks

Manfred Rudolf Bihy

Since you haven't shown any code helping you seems futile, but I'm sure you have checked the meaning of HTTP return code of 403: http://en.wikipedia.org/wiki/HTTP_403[^]. Just a well meant hint. Cheers!

Luc Pattyn

Hi, 403 means "forbidden", which could be many things, however it is decided by the server, and the net result is you aren't getting any data. So it is not the parsing that is at fault, it is the way you ask for the web page. I tried http://www.imdb.com with my existing C# program and it loads fine; one thing I remember very well doing after some sporadic failures, is provide a realistic "useragent", which is a string explaining what the client's characteristics/capabilities are. I use "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.19) Gecko/2010031422 Firefox/3.0.17" which was what FireFox emitted at that time. I suggest you figure out where and how to specify such useragent in your code. :)

Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum

Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.

Aljaz111

The code is like this:

CleanerProperties props = new CleanerProperties();
HtmlCleaner test=new HtmlCleaner();
test.clean(new URL("http://www.imdb.com/find?s=all&q=burek"));

In c# i have no problems too. But in java there i have errors, which i specified. Any other parser, that would be useful for IMDB? Thanks

Luc Pattyn

My C# code doesn't work for that URL, i.e. it seems to return only half a HTML header and no body; there is a link tag though. My FF browser works, however its "view page source" shows exactly the same stuff my C# app does. I'm puzzled by the link tag.

the "canonical" value is unknown in here[^]!!! There are google hits about it though... :)

Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum

Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.

Aljaz111

I used another way of parsing it.. xml serialization with imdb doesn't work.. so i am doing it with TagNodes that HtmlParsers supports and its quite easy! Maybe you know how to replace this spec char which i am getting

"""

because with replace it doesn't work?! Thanks

modified on Monday, March 14, 2011 11:42 PM

all_in_flames

I would hazard a guess that the 403 Forbidden error is the result of IMDB not allowing their web interfaces to be used as a web service (querying for data directly without viewing the content on their site, including the all-important advertising :)). They likely accomplish this with a bizarre browser behaviour trick, as Luc and yourself seem to have seen with the strange canonical link tag. You may want to look into if IMDB hosts a query interface for applications, but if they do, it's likely a premium service (AKA a paid service). Cheers!