Stream writing html

Mikke_x

Hi all! I have a wierd problem and I can´t understand why it happens. I have a couple of html pages that I read from and write to. When I read and write I use FileStream.ReadByte() and FileStream.WriteByte(). The problem I have is that when I view the file in IE after it is done, IE can not recognise the my swedish characters ( å,ä,ö ). What really bothers me is that if I take the source of the page via IE and copy paste it into a textfile and give it the extension html there are no problems at all. There is absolutely no differences between the files that work and the ones that does not work, the content of the pages are the same, but still IE can read one and not the other.. :doh: There have to be some differences in some part of the not visible content of the files, but how do I resolve the problem? I need to save the files, and in my current implementation I would like to use stream.WriteByte() but I would also like to keep the swedish characters. Please advice.. / Mikke

Steven Campbell

It is usually inappropriate to write text (html) at the byte level. This is because modern unicode strings do not have a simple 1 byte = 1 character mapping. What is probably happening is that the file you are reading from is in one text encoding (e.g. UTF8), and the file you are writing to is in another (e.g. ASCII). Just read and write strings, and you should be fine. It may help to educate yourself on text encodings first though. One of the better articles on this subject is from Joel on Software.

my blog

Heath Stewart

Steven's right, but what you need to do is actually quite easy. Wrap your FileStream (or whatever Stream derivative you're using) in a StreamReader or StreamWriter, which makes reading and writing text much easier. You can specify the encoding (like Encoding.Unicode, since you're probably dealing with Unicode characters on the 'net, or UTF8 which is most common (a multi-byte character set, or MBCS)) and with the `StreamReader` you can set a parameter in the constructor that attempts to detect the encoding by detecting a byte-order mark (BOM). For example: using (FileStream file = File.Open("somefile.html", FileMode.Open)) { string line = null; using (StreamReader reader = new StreamReader(file, Encoding.UTF8, true)) { // It's bad to use reader.ReadToEnd - you might not have enough memory line = reader.ReadLine(); // Do something with the 'line' } } The `using` blocks make sure the streams are closed (i.e., the native file handles are released - very important or your performance is hampered, or - worse yet - you have memory leaks). You should really read about the `Encoding` class in the .NET Framework SDK. You could still read and write bytes but you must understand that ANSI is a single-byte character set, and may use different code pages. ASCII characters use only 7 bits in the byte (the first 128 values). The last 128 values can be different depending on the codepage (local characters, or defined for particular systems). UTF8 can handle ASCII characters, but those latter characters actually tell the decoder to use 2 bytes to read the character, thus supporting Unicode as well. This is a great encoding to use on legacy systems, especially if you're not sure what a text file will contain (although a legacy decore might have problems). Unicode comes in many flavors, and you can read more on the web if you try a google search or click "Search comments" above for previous discussions we've had about this. _This posting is provided "AS IS" with no warranties, and confers no rights._ Software Design Engineer Developer Division Sustained Engineering Microsoft [[My Articles](http://www.codeproject.com/script/articles/list_articles.asp?userid=46969)]

Mikke_x

Hi! Thanks alot! Both to Heath and Steven! Will try this first thing monday morning! The Joel on Software text gave som intresting, and fun!, reading. / Mikke