Character encoding when reading a text file of unknown encoding, to create an XML file
-
Hi Guys, I'm working on a project where I need to parse data coming from flat files into XML. All was working well until I was sent a csv file which contained a › character (nb: this is not a > (62), but › (155)). When this is written to my XML, it is displayed as a square box. I'd assumed that if I set the stream reader to automatically determine the file's encoding, and set the XML writer to use unicode, all character conversions would take place automatically, but it seems that either that's not the case, or I've missed something. Does anyone know what about the following code could cause extended characterset chars to not be correctly converted?
public XmlDocument ToXmlDocument(string filename, bool headerRow) { string tempfile = System.IO.Path.GetTempFileName(); XmlTextWriter writer = new XmlTextWriter(tempfile, Encoding.Unicode); writer.WriteStartDocument(false); writer.WriteStartElement(XML_ROOT_ELEMENT); using (TextReader reader = new StreamReader(filename,true)) { string line; while ((line = reader.ReadLine())!=null) { if (headerRow) { headerRow = false; } else { ProcessCsvRow(line, ref writer); } } } writer.WriteEndElement(); writer.WriteEndDocument(); writer.Close(); XmlDocument result = new XmlDocument(); result.Load(tempfile); System.IO.File.Delete(tempfile); return result; }
Thanks in advance, JB -
Hi Guys, I'm working on a project where I need to parse data coming from flat files into XML. All was working well until I was sent a csv file which contained a › character (nb: this is not a > (62), but › (155)). When this is written to my XML, it is displayed as a square box. I'd assumed that if I set the stream reader to automatically determine the file's encoding, and set the XML writer to use unicode, all character conversions would take place automatically, but it seems that either that's not the case, or I've missed something. Does anyone know what about the following code could cause extended characterset chars to not be correctly converted?
public XmlDocument ToXmlDocument(string filename, bool headerRow) { string tempfile = System.IO.Path.GetTempFileName(); XmlTextWriter writer = new XmlTextWriter(tempfile, Encoding.Unicode); writer.WriteStartDocument(false); writer.WriteStartElement(XML_ROOT_ELEMENT); using (TextReader reader = new StreamReader(filename,true)) { string line; while ((line = reader.ReadLine())!=null) { if (headerRow) { headerRow = false; } else { ProcessCsvRow(line, ref writer); } } } writer.WriteEndElement(); writer.WriteEndDocument(); writer.Close(); XmlDocument result = new XmlDocument(); result.Load(tempfile); System.IO.File.Delete(tempfile); return result; }
Thanks in advance, JBHi,
Inverso1 wrote:
using (TextReader reader = new StreamReader(filename,true))
AFAIK a text stream constructor without explicit encoding will: 1. look for a byte order mask indicating the stream holds Unicode/UTF8 characters 2. lacking that, assume it is an 8-bit encoding corresponding to the thread's CultureInfo which defaults to your system's default "code page", which would be e.g. 1252 in Western Europe. As you said 155 would be "single right-pointing angle quotation mark" in standard ANSI (see e.g. here[^]). And it should work well if your system or your thread were set to ANSI. It also maps OK in Windows code page 1252 (see here[^]). But it wouldn't in many others. I suspect your system/app is not using a code page that interprets 155 as "single right-pointing angle quotation mark". I suggest: 1. you have a look at the file using Notepad 2. you check your code page (I don't know how by heart!) 3. you explicitly set the encoding when opening the stream, assuming you know it to be constant. And of course you'll risk getting in trouble again when dealing with 8-bit text files from different origins, that was after all the reason they invented Unicode. :)
Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.
-
Hi,
Inverso1 wrote:
using (TextReader reader = new StreamReader(filename,true))
AFAIK a text stream constructor without explicit encoding will: 1. look for a byte order mask indicating the stream holds Unicode/UTF8 characters 2. lacking that, assume it is an 8-bit encoding corresponding to the thread's CultureInfo which defaults to your system's default "code page", which would be e.g. 1252 in Western Europe. As you said 155 would be "single right-pointing angle quotation mark" in standard ANSI (see e.g. here[^]). And it should work well if your system or your thread were set to ANSI. It also maps OK in Windows code page 1252 (see here[^]). But it wouldn't in many others. I suspect your system/app is not using a code page that interprets 155 as "single right-pointing angle quotation mark". I suggest: 1. you have a look at the file using Notepad 2. you check your code page (I don't know how by heart!) 3. you explicitly set the encoding when opening the stream, assuming you know it to be constant. And of course you'll risk getting in trouble again when dealing with 8-bit text files from different origins, that was after all the reason they invented Unicode. :)
Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.
Hey Luc, thanks for your response; adding the encoding fixed the issue. I am a little worried about the possibility of other codepages, since the idea is that this system can take data from any source. Guess I'll have to configure it to allow different data sources (i.e. file locations) to have a custom codepage specified. For people following this thread, you may be interested in the code below, which gives the systems' default codepage (as Luc mentioned, for me it's 1252).
Console.WriteLine(Encoding.Default.CodePage.ToString());
I've now modified one line of my code to read as follows, which seems to have fixed the issue:using (TextReader reader = new StreamReader(filename,Encoding.Default,true))
It seems a bit strange that the system's not using the default by default, but hey, all's good. I've also tried chucking UTF-8 and UTF-16 formatted text documents through and these were also encoded correctly. Thanks again, JB