Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Java
  4. Html parser

Html parser

Scheduled Pinned Locked Moved Java
helpcsharpjavahtmlcom
7 Posts 4 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A Offline
    A Offline
    Aljaz111
    wrote on last edited by
    #1

    I am looking for any html parser, that could output xml from inputstream(IMDB search results) or just parsed code into structures with filter tags. I tryed with HTMLCleaner but it doesn't supports imdb site. I get this error

    Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.imdb.com/
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
    at java.net.URL.openStream(URL.java:1010)
    at org.htmlcleaner.Utils.getCharsetFromContent(Utils.java:121)
    at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:299)
    at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:317)
    at Main.main(Main.java:25)

    I also tryed HTMLParser([^]), but i can't get correct data with it. If anyone has experience with parsing IMDB html code i would be really thankful for any kind of help. Thanks

    M L A 3 Replies Last reply
    0
    • A Aljaz111

      I am looking for any html parser, that could output xml from inputstream(IMDB search results) or just parsed code into structures with filter tags. I tryed with HTMLCleaner but it doesn't supports imdb site. I get this error

      Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.imdb.com/
      at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
      at java.net.URL.openStream(URL.java:1010)
      at org.htmlcleaner.Utils.getCharsetFromContent(Utils.java:121)
      at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:299)
      at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:317)
      at Main.main(Main.java:25)

      I also tryed HTMLParser([^]), but i can't get correct data with it. If anyone has experience with parsing IMDB html code i would be really thankful for any kind of help. Thanks

      M Offline
      M Offline
      Manfred Rudolf Bihy
      wrote on last edited by
      #2

      Since you haven't shown any code helping you seems futile, but I'm sure you have checked the meaning of HTTP return code of 403: http://en.wikipedia.org/wiki/HTTP_403[^]. Just a well meant hint. Cheers!

      1 Reply Last reply
      0
      • A Aljaz111

        I am looking for any html parser, that could output xml from inputstream(IMDB search results) or just parsed code into structures with filter tags. I tryed with HTMLCleaner but it doesn't supports imdb site. I get this error

        Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.imdb.com/
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
        at java.net.URL.openStream(URL.java:1010)
        at org.htmlcleaner.Utils.getCharsetFromContent(Utils.java:121)
        at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:299)
        at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:317)
        at Main.main(Main.java:25)

        I also tryed HTMLParser([^]), but i can't get correct data with it. If anyone has experience with parsing IMDB html code i would be really thankful for any kind of help. Thanks

        L Offline
        L Offline
        Luc Pattyn
        wrote on last edited by
        #3

        Hi, 403 means "forbidden", which could be many things, however it is decided by the server, and the net result is you aren't getting any data. So it is not the parsing that is at fault, it is the way you ask for the web page. I tried http://www.imdb.com with my existing C# program and it loads fine; one thing I remember very well doing after some sporadic failures, is provide a realistic "useragent", which is a string explaining what the client's characteristics/capabilities are. I use "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.19) Gecko/2010031422 Firefox/3.0.17" which was what FireFox emitted at that time. I suggest you figure out where and how to specify such useragent in your code. :)

        Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum

        Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.

        A 1 Reply Last reply
        0
        • L Luc Pattyn

          Hi, 403 means "forbidden", which could be many things, however it is decided by the server, and the net result is you aren't getting any data. So it is not the parsing that is at fault, it is the way you ask for the web page. I tried http://www.imdb.com with my existing C# program and it loads fine; one thing I remember very well doing after some sporadic failures, is provide a realistic "useragent", which is a string explaining what the client's characteristics/capabilities are. I use "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.19) Gecko/2010031422 Firefox/3.0.17" which was what FireFox emitted at that time. I suggest you figure out where and how to specify such useragent in your code. :)

          Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum

          Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.

          A Offline
          A Offline
          Aljaz111
          wrote on last edited by
          #4

          The code is like this:

          CleanerProperties props = new CleanerProperties();
          HtmlCleaner test=new HtmlCleaner();
          test.clean(new URL("http://www.imdb.com/find?s=all&q=burek"));

          In c# i have no problems too. But in java there i have errors, which i specified. Any other parser, that would be useful for IMDB? Thanks

          L 1 Reply Last reply
          0
          • A Aljaz111

            The code is like this:

            CleanerProperties props = new CleanerProperties();
            HtmlCleaner test=new HtmlCleaner();
            test.clean(new URL("http://www.imdb.com/find?s=all&q=burek"));

            In c# i have no problems too. But in java there i have errors, which i specified. Any other parser, that would be useful for IMDB? Thanks

            L Offline
            L Offline
            Luc Pattyn
            wrote on last edited by
            #5

            My C# code doesn't work for that URL, i.e. it seems to return only half a HTML header and no body; there is a link tag though. My FF browser works, however its "view page source" shows exactly the same stuff my C# app does. I'm puzzled by the link tag.

            <link rel="canonical" href="http://www.imdb.com/find?s=all&amp;q=burek" />

            the "canonical" value is unknown in here[^]!!! There are google hits about it though... :)

            Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum

            Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.

            A 1 Reply Last reply
            0
            • L Luc Pattyn

              My C# code doesn't work for that URL, i.e. it seems to return only half a HTML header and no body; there is a link tag though. My FF browser works, however its "view page source" shows exactly the same stuff my C# app does. I'm puzzled by the link tag.

              <link rel="canonical" href="http://www.imdb.com/find?s=all&amp;q=burek" />

              the "canonical" value is unknown in here[^]!!! There are google hits about it though... :)

              Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum

              Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.

              A Offline
              A Offline
              Aljaz111
              wrote on last edited by
              #6

              I used another way of parsing it.. xml serialization with imdb doesn't work.. so i am doing it with TagNodes that HtmlParsers supports and its quite easy! Maybe you know how to replace this spec char which i am getting

              """

              because with replace it doesn't work?! Thanks

              modified on Monday, March 14, 2011 11:42 PM

              1 Reply Last reply
              0
              • A Aljaz111

                I am looking for any html parser, that could output xml from inputstream(IMDB search results) or just parsed code into structures with filter tags. I tryed with HTMLCleaner but it doesn't supports imdb site. I get this error

                Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.imdb.com/
                at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
                at java.net.URL.openStream(URL.java:1010)
                at org.htmlcleaner.Utils.getCharsetFromContent(Utils.java:121)
                at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:299)
                at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:317)
                at Main.main(Main.java:25)

                I also tryed HTMLParser([^]), but i can't get correct data with it. If anyone has experience with parsing IMDB html code i would be really thankful for any kind of help. Thanks

                A Offline
                A Offline
                all_in_flames
                wrote on last edited by
                #7

                I would hazard a guess that the 403 Forbidden error is the result of IMDB not allowing their web interfaces to be used as a web service (querying for data directly without viewing the content on their site, including the all-important advertising :)). They likely accomplish this with a bizarre browser behaviour trick, as Luc and yourself seem to have seen with the strange canonical link tag. You may want to look into if IMDB hosts a query interface for applications, but if they do, it's likely a premium service (AKA a paid service). Cheers!

                1 Reply Last reply
                0
                Reply
                • Reply as topic
                Log in to reply
                • Oldest to Newest
                • Newest to Oldest
                • Most Votes


                • Login

                • Don't have an account? Register

                • Login or register to search.
                • First post
                  Last post
                0
                • Categories
                • Recent
                • Tags
                • Popular
                • World
                • Users
                • Groups