Programming, technology, and CRM – from a Belgian programmer exiled to Missouri
  • rss
  • Home
  • Soft Gallery
    • autosvnbackup.sh
    • VBScript Snippets
  • Contact Me
  • Welcome

The unicode BOM (or, what are these funny characters at the beginning of my file and how did they get there)

Nicolas Galler | January 27, 2008

Alright, this isn’t much, and is pretty old news, but it was pretty aggravating to look for it this week-end so I might as well jot it down for later.

Somehow last week I started getting some “Invalid Character” errors all over the place (or maybe I just started noticing them, I don’t know). Some came up in msbuild scripts, and some in Javascript files. They just looked like 1 or 3 gibberish characters at the very beginning of the file. I kind of dismissed it because the errors went away after I opened the file in vim and re-saved it, but they came back with a vengeance last week-end when I found out that was the ultimate cause for my Django templates messing up. I really found out I had a bona fide Django bug there for a moment, but it was just copying the “BOM” (byte order mark) from my source files.

So what is this BOM anyway? Simply put they are a few bytes inserted at the beginning of a (unicode) file to help the computer determine how to read it.
It doesn’t really make sense for UTF-8 (the most common default encoding) because the order in those is fixed! But for a double-byte encoding like UTF-16 you have to know whether to put the first byte first or last (there is a longer story to this “endianness” as they call it but let’s cut short here). ANYWAY, some programs will still put a BOM in UTF-8 file, consisting of the 3 bytes 0xEF, 0xBB, 0xBF. And some programs will manage to choke on it. OR, it will have some strange effect when you try to do some things to the files like concatenate them or whatever.

Where do they come from? It seems like Notepad (not Notepad2) will insert them automatically depending on your encoding settings.

To get rid of them, use “:set nobomb” in vim and save, or in Notepad2 change the encoding to “UTF-8″ (instead of “UTF-8 with signature”).

Here is the Wikipedia article which explains this in much more detail.

Categories
Tricks
Comments rss
Comments rss
Trackback
Trackback

« Windows then and now – the regressive evolution Get "Includes" in VBScript with WSF files »

Leave a Reply

Click here to cancel reply.

Categories

  • Experiments (4)
  • Interesting (1)
  • MSCRM (1)
  • Programming (60)
  • Rant (3)
  • Saleslogix (34)
  • Tricks (8)
  • Uncategorized (30)

Post History

  • 2010
    • January (3)
    • March (3)
    • April (2)
    • August (2)
  • 2009
    • March (2)
    • April (1)
    • May (3)
    • June (3)
    • July (1)
    • September (3)
    • October (2)
    • December (5)
  • 2008
    • January (9)
    • February (4)
    • March (9)
    • April (1)
    • May (5)
    • June (8)
    • July (1)
    • August (2)
    • September (1)
    • November (1)
    • December (3)
  • 2007
    • January (3)
    • February (7)
    • March (1)
    • April (3)
    • May (6)
    • June (2)
    • July (1)
    • August (2)
    • September (5)
    • October (3)
    • November (5)
    • December (4)
  • 2006
    • January (2)
    • September (1)
    • November (3)
    • December (4)
  • 2005
    • April (1)

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org
rss Comments rss valid xhtml 1.1 design by jide powered by Wordpress get firefox