The unicode BOM (or, what are these funny characters at the beginning of my file and how did they get there)
Nicolas Galler | January 27, 2008Alright, this isn’t much, and is pretty old news, but it was pretty aggravating to look for it this week-end so I might as well jot it down for later.
Somehow last week I started getting some “Invalid Character” errors all over the place (or maybe I just started noticing them, I don’t know). Some came up in msbuild scripts, and some in Javascript files. They just looked like 1 or 3 gibberish characters at the very beginning of the file. I kind of dismissed it because the errors went away after I opened the file in vim and re-saved it, but they came back with a vengeance last week-end when I found out that was the ultimate cause for my Django templates messing up. I really found out I had a bona fide Django bug there for a moment, but it was just copying the “BOM” (byte order mark) from my source files.
So what is this BOM anyway? Simply put they are a few bytes inserted at the beginning of a (unicode) file to help the computer determine how to read it.
It doesn’t really make sense for UTF-8 (the most common default encoding) because the order in those is fixed! But for a double-byte encoding like UTF-16 you have to know whether to put the first byte first or last (there is a longer story to this “endianness” as they call it but let’s cut short here). ANYWAY, some programs will still put a BOM in UTF-8 file, consisting of the 3 bytes 0xEF, 0xBB, 0xBF. And some programs will manage to choke on it. OR, it will have some strange effect when you try to do some things to the files like concatenate them or whatever.
Where do they come from? It seems like Notepad (not Notepad2) will insert them automatically depending on your encoding settings.
To get rid of them, use “:set nobomb” in vim and save, or in Notepad2 change the encoding to “UTF-8″ (instead of “UTF-8 with signature”).
Here is the Wikipedia article which explains this in much more detail.





