Repairing broken XML
-
I have a 3rd-party app that generates some "almost XML" files that I need to parse. It has elements similar to the following:
<color
name = "Black"
colorspace = "CMYK"
cyan = 0.000000
magenta = 0.000000
yellow = 0.000000
black = 100.000000
/>Notice that the attributes with numeric values aren't quoted as they should be. There are also a few empty elements that appear as
<data >
(the element name and three spaces), although it's an empty element and should be<data />
. These two deviations from true XML are making it impossible for me to simply load the XML into an XMLDocument so that I can easily access the elements I need. I don't normally work with XML a whole lot. I was wondering if anyone knows of any "simple" methods or an existing library that can correct these errors in the XML as it's read from the file. The empty element problem I think I can deal with pretty easily with a simple search/replace, as it seems there's only one element in the file that's ever munged this way, but the missing quotes problem is much bigger, as 99% of the numeric attribures are broken, in all elements. TIA for any help with this.
Grim
(aka Toby)
MCDBA, MCSD, MCP+SB
SELECT * FROM user WHERE clue IS NOT NULL GO
(0 row(s) affected)
-
I have a 3rd-party app that generates some "almost XML" files that I need to parse. It has elements similar to the following:
<color
name = "Black"
colorspace = "CMYK"
cyan = 0.000000
magenta = 0.000000
yellow = 0.000000
black = 100.000000
/>Notice that the attributes with numeric values aren't quoted as they should be. There are also a few empty elements that appear as
<data >
(the element name and three spaces), although it's an empty element and should be<data />
. These two deviations from true XML are making it impossible for me to simply load the XML into an XMLDocument so that I can easily access the elements I need. I don't normally work with XML a whole lot. I was wondering if anyone knows of any "simple" methods or an existing library that can correct these errors in the XML as it's read from the file. The empty element problem I think I can deal with pretty easily with a simple search/replace, as it seems there's only one element in the file that's ever munged this way, but the missing quotes problem is much bigger, as 99% of the numeric attribures are broken, in all elements. TIA for any help with this.
Grim
(aka Toby)
MCDBA, MCSD, MCP+SB
SELECT * FROM user WHERE clue IS NOT NULL GO
(0 row(s) affected)
hi, Library you need will (hopefully) be my school work :) meanwhile, you can check HTML Tidy (http://www.w3.org/People/Raggett/tidy/[^]) It has some XML support. best regards, David 'DNH' Nohejl Never forget: "Stay kul and happy" (I.A.)
-
hi, Library you need will (hopefully) be my school work :) meanwhile, you can check HTML Tidy (http://www.w3.org/People/Raggett/tidy/[^]) It has some XML support. best regards, David 'DNH' Nohejl Never forget: "Stay kul and happy" (I.A.)
Thanks, David. I took a look at Tidy, but since it's specific to HTML and won't process a file with unknown tags, it won't work for me in its existing incarnation. The source code, however, will give me some good insight into how to parse the XML and correct it myself on-the-fly.
Grim
(aka Toby)
MCDBA, MCSD, MCP+SB
SELECT * FROM user WHERE clue IS NOT NULL GO
(0 row(s) affected)
-
I have a 3rd-party app that generates some "almost XML" files that I need to parse. It has elements similar to the following:
<color
name = "Black"
colorspace = "CMYK"
cyan = 0.000000
magenta = 0.000000
yellow = 0.000000
black = 100.000000
/>Notice that the attributes with numeric values aren't quoted as they should be. There are also a few empty elements that appear as
<data >
(the element name and three spaces), although it's an empty element and should be<data />
. These two deviations from true XML are making it impossible for me to simply load the XML into an XMLDocument so that I can easily access the elements I need. I don't normally work with XML a whole lot. I was wondering if anyone knows of any "simple" methods or an existing library that can correct these errors in the XML as it's read from the file. The empty element problem I think I can deal with pretty easily with a simple search/replace, as it seems there's only one element in the file that's ever munged this way, but the missing quotes problem is much bigger, as 99% of the numeric attribures are broken, in all elements. TIA for any help with this.
Grim
(aka Toby)
MCDBA, MCSD, MCP+SB
SELECT * FROM user WHERE clue IS NOT NULL GO
(0 row(s) affected)
I would think that a Regular Expression would be the best way to fix this - RegExp
"When the only tool you have is a hammer, a sore thumb you will have."