The Day I Learned to Deal With the BOM

kmoorevs

Edited: The original Subject here was 'The Day I Learned to Hate the BOM'. After considering the first response, I have modified it to reflect a more appropriate and less offensive title. :) I'm only explaining this to provide context for the first responder's grievance. (and strike tags are not allowed in the subject) I've been importing text files for over 20 years and I thought I'd seen it all!...CSV lines enclosed in double-quotes, CSV fields with commas (not enclosed in double-quotes), implied decimal places, packed decimal format, subtotal lines, total lines, garbage lines, etc. What I haven't seen before now are text files with byte order mark (BOM) characters that don't seem to do anything significant except to require dealing with. :omg: It's probably been an issue long before now, but 90% of the time, the files have a header row so it wouldn't have mattered. For fun, or as an office joke, you can add BOM to your own text files by changing the Encoding in NotePad.

"Go forth into the source" - Neal Morse "Hope is contagious"

trønderen

Is that the same day as you saw BOM for the first time? What is wrong with BOM? That you don't want to see it at all? That for this specific file, it "isn't needed", because all it contains is 7-bit US-ASCII? You may of course declare that "I handle 7-bit US-ASCII only, so don't disturb me with anything else - Unicode is not my business, even when the characters are limited to US-ASCII. Just bug off with anything else!" Fair enough, but then the software you produce is not for me. Most likely not for very many people outside the English speaking world. You are most certainly expected to handle BOM and Unicode/UTF8 today. If you are handling CSV, your are supposed to understand that in major parts of the world, it is five comma three, not five point three. For the other stuff you mention: Fair enough; that should be easy to handle. If you haven't yet discovered that there are lots of variations in CSV formats (e.g. they are certainly not always comma separated, like in cultures where the decimal separator is not the point), take a look at the import options in Excel: It has handled this stuff for at least 15 year - I'd guess a lot more. Accept reality as it is, not as if reality was a problem. And so on. You are free to put up a sign at your door: "Warning: No non-American data is handled by our software!" Actually, I'd be happy to know in advance, rather than discovering after I have signed a contract.

kmoorevs

Thanks for the reply and for allowing me to realize that I had mis-titled the thread. I understand that the word 'Hate' was inappropriate for a number of reasons, firstly because I despise that word, and secondly because of your point about reality. In my defense, the 'problem' had already been dealt with, (code fixed, tested, compiled, tested, and deployed) for the two affected imports before I made that post. I was simply annoyed about spending that hour diagnosing and fixing something that I'd never seen before...weird characters trashing the first element in a csv. :confused: Per your post, I had a few ways to deal with it, now having a crash course in BOM: 0: Tell the customer that they would have to change the encoding. (put up a sign) The customer gets these data files from other departments/systems. To change the encoding is would likely be a manual process. Not ideal at all. 1: Be aware that a text/csv file might have those three weird characters at the beginning of the file. If so, disregard the BOM and move on. 2: Go into full research mode and discover all that there is to know about BOM and how it might be needed for all of my non-US customers. (current/future === none) I chose option 1. The reality is that I have to deal with it. The chances that I get a file from a US customer that is BOM encoded, and that does not have headers, and where the first element of the line is important have become realized. I didn't account for the BOM and got 'bitten'. :laugh: Anyhow, my customer is happy again. :) I've learned something new and improved my software. :) One of my favorite things about this profession is the constant learning and improving.

"Go forth into the source" - Neal Morse "Hope is contagious"

Lost User

I wrote about this 11 years ago: Handling simple text files in C/C++[^].

PIEBALDconsult

"UTF-8 BOM considered harmful." :-D I'm assuming you're talking about UTF-8. Including a BOM with UTF-8 is generally unnecessary and frowned upon in most cases. I'll have to review my latest text file reader to see what I do with them -- one I wrote earlier this year would ignore a UTF-8 BOM if it was the first three bytes of a file. In general, a file should have only one encoding, but there is no requirement for that to be the case. If a file switches between encoding, then I believe a suitable BOM must be used to indicate that. But good luck finding a reader which will honor such an abomination. When this sort of thing is more likely to be a concern is when reading from a non-file-backed stream such as a socket -- but I would expect that other controls would be put in place to avoid confusion. Now I need to go create a CSV file which is UTF-8, but with a UTF-16 Word document as one of its values... ;P (muhahaha!)

PIEBALDconsult

The only thing to hate is hate itself.

honey the codewitch

I forget the exact code, but the byte order mark is easy to check for and skip programmatically. Something like the high bit is set. (I forget the actual details) so you can just check for that, and if you see it, skip that byte and the next. If I got something wrong here, forgive me, it's half remembered from years ago, but the concept is simple enough. Check with a mask, and then skip the mark if the mask is set.

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix