StreamWriter.WriteLine converting hex A0 to hex EF BF BD
-
In 2006, I wrote a C# app (using .NET 2.0 in VS 2005 Express) which was designed to add some PHP code to several HTML files. The HTML files are encoded as ISO 8859-1. This application worked fine until around Nov 2007 when some unknown change occurred in my Windows XP installation which resulted in hex A0 ( in html) being converted to hex EF BF BD. I have since found info which states that an illegal character will be recoded as EF BF BD. However, A0 is a LEGAL character in ISO 8859-1 encoding. In trying to resolve this problem I have tried all of the StreamWriter class Encoding options and have also completely uninstalled VS 2005 and all .NET SDK's and runtime and then subsequently installed .NET 3.5 SDK's (which includes the latest 2.0 SDK) and runtimes and also C# VS 2008. None of these actions resolved the problem. Any ideas on how to prevent this conversion from occurring? The code sequence which makes use of StreamWriter.WriteLine is as follows:
private void Process_Files() { const bool OVERWRITE = true; const bool APPEND = true; ArrayList file_array = new ArrayList(); string newString; bool found = false; foreach (string fileName in fileList) { Console.WriteLine("Processing: " + fileName + ".htm"); // Add PHP authentication info to top of file FileInfo srcFile = new FileInfo("PHP-Auth.txt"); srcFile.CopyTo(HTM_FILE_PATH + "temp.php", OVERWRITE); StreamReader htmFile = new StreamReader(HTM_FILE_PATH + fileName + ".htm"); StreamWriter tmpFile = new StreamWriter(HTM_FILE_PATH + "temp.php", APPEND); // Read the entire .htm file into memory. while (!htmFile.EndOfStream) { file_array.Add(htmFile.ReadLine()); } htmFile.Close(); // Append a blank line to the end of the temp.php file. tmpFile.WriteLine(); tmpFile.WriteLine(); // Process each line of the .htm file. foreach (string line in file_array) { // Compare each name in the fileList to each line of this particular // .htm file. foreach (string name in fileList) { // If this line of the file contains one of the filenames in "fileList"
-
In 2006, I wrote a C# app (using .NET 2.0 in VS 2005 Express) which was designed to add some PHP code to several HTML files. The HTML files are encoded as ISO 8859-1. This application worked fine until around Nov 2007 when some unknown change occurred in my Windows XP installation which resulted in hex A0 ( in html) being converted to hex EF BF BD. I have since found info which states that an illegal character will be recoded as EF BF BD. However, A0 is a LEGAL character in ISO 8859-1 encoding. In trying to resolve this problem I have tried all of the StreamWriter class Encoding options and have also completely uninstalled VS 2005 and all .NET SDK's and runtime and then subsequently installed .NET 3.5 SDK's (which includes the latest 2.0 SDK) and runtimes and also C# VS 2008. None of these actions resolved the problem. Any ideas on how to prevent this conversion from occurring? The code sequence which makes use of StreamWriter.WriteLine is as follows:
private void Process_Files() { const bool OVERWRITE = true; const bool APPEND = true; ArrayList file_array = new ArrayList(); string newString; bool found = false; foreach (string fileName in fileList) { Console.WriteLine("Processing: " + fileName + ".htm"); // Add PHP authentication info to top of file FileInfo srcFile = new FileInfo("PHP-Auth.txt"); srcFile.CopyTo(HTM_FILE_PATH + "temp.php", OVERWRITE); StreamReader htmFile = new StreamReader(HTM_FILE_PATH + fileName + ".htm"); StreamWriter tmpFile = new StreamWriter(HTM_FILE_PATH + "temp.php", APPEND); // Read the entire .htm file into memory. while (!htmFile.EndOfStream) { file_array.Add(htmFile.ReadLine()); } htmFile.Close(); // Append a blank line to the end of the temp.php file. tmpFile.WriteLine(); tmpFile.WriteLine(); // Process each line of the .htm file. foreach (string line in file_array) { // Compare each name in the fileList to each line of this particular // .htm file. foreach (string name in fileList) { // If this line of the file contains one of the filenames in "fileList"
Have you tried to simply specify the encoding when you open the reader and writer?
Encoding isoWesternEuropean = Encoding.GetEncoding(28591);
StreamReader htmFile = new StreamReader(HTM_FILE_PATH + fileName + ".htm", isoWesternEuropean);
StreamWriter tmpFile = new StreamWriter(HTM_FILE_PATH + "temp.php", APPEND, isoWesternEuropean);If that doesn't work, the file that you are reading contains a byte order mark (BOM) that overrides the encoding. You can check this by adding the .bin extension to the file and open it in Visual Studio to examine the actual binary data in the file. If it contains a BOM, the file is broken, as it contains information about decoding it that doesn't match how it was encoded. To fix this you would either have to use/write a program that removes the BOM, or open the file as a binary stream so that you can read past the BOM before starting to read the stream with the StreamReader.
Despite everything, the person most likely to be fooling you next is yourself.
-
Have you tried to simply specify the encoding when you open the reader and writer?
Encoding isoWesternEuropean = Encoding.GetEncoding(28591);
StreamReader htmFile = new StreamReader(HTM_FILE_PATH + fileName + ".htm", isoWesternEuropean);
StreamWriter tmpFile = new StreamWriter(HTM_FILE_PATH + "temp.php", APPEND, isoWesternEuropean);If that doesn't work, the file that you are reading contains a byte order mark (BOM) that overrides the encoding. You can check this by adding the .bin extension to the file and open it in Visual Studio to examine the actual binary data in the file. If it contains a BOM, the file is broken, as it contains information about decoding it that doesn't match how it was encoded. To fix this you would either have to use/write a program that removes the BOM, or open the file as a binary stream so that you can read past the BOM before starting to read the stream with the StreamReader.
Despite everything, the person most likely to be fooling you next is yourself.
That did the trick. Thanks very much. I had tried all of the obvious encodings, but there is no obvious encoding for ISO Western. 1. How did you know that ISO 8859-1 was represented by 28591 (in .NET) (i.e., where would I have looked to find such information)? 2. The fact that you were able to respond with an answer to this type of question conveys to me that you know a fair amount about programming. What kinds of things did you do to get to the understanding of programming that you have today? Thanks again and have a great day!!!
-
That did the trick. Thanks very much. I had tried all of the obvious encodings, but there is no obvious encoding for ISO Western. 1. How did you know that ISO 8859-1 was represented by 28591 (in .NET) (i.e., where would I have looked to find such information)? 2. The fact that you were able to respond with an answer to this type of question conveys to me that you know a fair amount about programming. What kinds of things did you do to get to the understanding of programming that you have today? Thanks again and have a great day!!!
Mike Bluett wrote:
1. How did you know that ISO 8859-1 was represented by 28591 (in .NET) (i.e., where would I have looked to find such information)?
In the documentation on the page about the
Encoding
class, there is a list of encodings: MSDN Library: Encoding class[^]Mike Bluett wrote:
2. The fact that you were able to respond with an answer to this type of question conveys to me that you know a fair amount about programming. What kinds of things did you do to get to the understanding of programming that you have today?
Well, I did a lot of programming. :) I have used many different programming languages on several different platforms. It helps to have done a bit of machine level programming, so that you know what really happens below the surface. Also, the last years I have been hanging out a lot in forums like this, helping people. You learn a lot from that. :)
Despite everything, the person most likely to be fooling you next is yourself.