Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. Load corrupted HTML into XML document

Load corrupted HTML into XML document

Scheduled Pinned Locked Moved C#
xmlhelpcsharphtmltutorial
5 Posts 3 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L Offline
    L Offline
    Lutoslaw
    wrote on last edited by
    #1

    I'd like to extract a specific data from an HTML web page. I have created an XPath expression to do it for me. The problem is that the HTML page is corrupted and the XmlDocument throws XmlException at me. How to make it working like a browser: ignore errors and continue loading? A free html cleaning lib might help, but I couldn't find anything useful. Any help appreciated.

    Greetings - Gajatko Portable.NET is part of DotGNU, a project to build a complete Free Software replacement for .NET - a system that truly belongs to the developers.

    RaviBeeR B 2 Replies Last reply
    0
    • L Lutoslaw

      I'd like to extract a specific data from an HTML web page. I have created an XPath expression to do it for me. The problem is that the HTML page is corrupted and the XmlDocument throws XmlException at me. How to make it working like a browser: ignore errors and continue loading? A free html cleaning lib might help, but I couldn't find anything useful. Any help appreciated.

      Greetings - Gajatko Portable.NET is part of DotGNU, a project to build a complete Free Software replacement for .NET - a system that truly belongs to the developers.

      RaviBeeR Offline
      RaviBeeR Offline
      RaviBee
      wrote on last edited by
      #2

      Perhaps Tidy .NET[^] may help? Or you could use my StringParser[^] object to scrape the data. /ravi

      My new year resolution: 2048 x 1536 Home | Articles | My .NET bits | Freeware ravib(at)ravib(dot)com

      L 1 Reply Last reply
      0
      • L Lutoslaw

        I'd like to extract a specific data from an HTML web page. I have created an XPath expression to do it for me. The problem is that the HTML page is corrupted and the XmlDocument throws XmlException at me. How to make it working like a browser: ignore errors and continue loading? A free html cleaning lib might help, but I couldn't find anything useful. Any help appreciated.

        Greetings - Gajatko Portable.NET is part of DotGNU, a project to build a complete Free Software replacement for .NET - a system that truly belongs to the developers.

        B Offline
        B Offline
        Bruce Duncan
        wrote on last edited by
        #3

        You might try using the HTML Agility Pack[^]. It's worked reasonably well for me in the past.

        "Walking on water and developing software from a specification are easy if both are frozen."
        - Edward V. Berard

        L 1 Reply Last reply
        0
        • B Bruce Duncan

          You might try using the HTML Agility Pack[^]. It's worked reasonably well for me in the past.

          "Walking on water and developing software from a specification are easy if both are frozen."
          - Edward V. Berard

          L Offline
          L Offline
          Lutoslaw
          wrote on last edited by
          #4

          Thank you. It works fine. I have another question. Well, I want to make a simple word translator using an existing online dictionary (for my home use). The dictionary's homepage is http://www2.ling.pl[^]. The home page can be read successfuly. However, ling.pl has a nice feature: you can access the dictionary by typing a word after "/". For example http://www2.ling.pl/do[^] would naviagate straight to the "do" word definition. Unfortuantely,

          HttpWebResponse response = (HttpWebResponse)request.GetResponse();

          throws 404 error. Any ideas how to fix that?

          Greetings - Gajatko Portable.NET is part of DotGNU, a project to build a complete Free Software replacement for .NET - a system that truly belongs to the developers.

          1 Reply Last reply
          0
          • RaviBeeR RaviBee

            Perhaps Tidy .NET[^] may help? Or you could use my StringParser[^] object to scrape the data. /ravi

            My new year resolution: 2048 x 1536 Home | Articles | My .NET bits | Freeware ravib(at)ravib(dot)com

            L Offline
            L Offline
            Lutoslaw
            wrote on last edited by
            #5

            Thanks for answering my post. I tried both Tidy .NET and Tidy COM but they didn't satisfy me. I prefer Html Agility Pack suggested by Bruce.

            Greetings - Gajatko Portable.NET is part of DotGNU, a project to build a complete Free Software replacement for .NET - a system that truly belongs to the developers.

            1 Reply Last reply
            0
            Reply
            • Reply as topic
            Log in to reply
            • Oldest to Newest
            • Newest to Oldest
            • Most Votes


            • Login

            • Don't have an account? Register

            • Login or register to search.
            • First post
              Last post
            0
            • Categories
            • Recent
            • Tags
            • Popular
            • World
            • Users
            • Groups