Load corrupted HTML into XML document

Lutoslaw

I'd like to extract a specific data from an HTML web page. I have created an XPath expression to do it for me. The problem is that the HTML page is corrupted and the XmlDocument throws XmlException at me. How to make it working like a browser: ignore errors and continue loading? A free html cleaning lib might help, but I couldn't find anything useful. Any help appreciated.

Greetings - Gajatko Portable.NET is part of DotGNU, a project to build a complete Free Software replacement for .NET - a system that truly belongs to the developers.

RaviBee

Perhaps Tidy .NET[^] may help? Or you could use my StringParser[^] object to scrape the data. /ravi

My new year resolution: 2048 x 1536 Home | Articles | My .NET bits | Freeware ravib(at)ravib(dot)com

Bruce Duncan

You might try using the HTML Agility Pack[^]. It's worked reasonably well for me in the past.

"Walking on water and developing software from a specification are easy if both are frozen."
- Edward V. Berard

Lutoslaw

Thank you. It works fine. I have another question. Well, I want to make a simple word translator using an existing online dictionary (for my home use). The dictionary's homepage is http://www2.ling.pl[^]. The home page can be read successfuly. However, ling.pl has a nice feature: you can access the dictionary by typing a word after "/". For example http://www2.ling.pl/do[^] would naviagate straight to the "do" word definition. Unfortuantely,

HttpWebResponse response = (HttpWebResponse)request.GetResponse();

throws 404 error. Any ideas how to fix that?

Greetings - Gajatko Portable.NET is part of DotGNU, a project to build a complete Free Software replacement for .NET - a system that truly belongs to the developers.

Lutoslaw

Thanks for answering my post. I tried both Tidy .NET and Tidy COM but they didn't satisfy me. I prefer Html Agility Pack suggested by Bruce.

Greetings - Gajatko Portable.NET is part of DotGNU, a project to build a complete Free Software replacement for .NET - a system that truly belongs to the developers.