Annoying unicode Byte Order Marks in UTF-8 files

There is this annoying thing that happened to me a few times before that some files get additional invisible characters at the beginning of the file. They might look something like this:

It happened to me a few times when i was working with some people from Germany and they used different tools and different encodings so i thought it was some editor related issue. But it was strange as it would only appear in the diff screen of eclipse svn plugin not in regular edit window.

Some time later i found out that there is a special way of hinting the unicode encoding by adding a few special characters at the beginning of the file. Unicode can come in many many .... way to many flavors. To be able to figure out which is which and to read files properly some editors add special invisible sequence at the top of the file.

Its almost a good idea except that its not necessary for utf-8. I mean processor architecture nor type of system will not mess up your utf-8 encoded file so adding it is not really a good idea. But once you start to work with people from different companies and copy xmls back and forth plus export some data from mssql database or use windows tools to edit it .... well then you will have some files marked for sure.

The problem i had was that PHP validation does not understand these marks and keeps on throwing warnings so we have to remove them manually.

To find out if your files have BOMs iterate through them anc check a few first characters. If they match items in the array belo you probably got yourself a BOM.

    $sets = array(
                "\xFE",
                "\xFF",
                "\xFE\xFF",
                "\xFF\xFE",
                "\xEF\xBB\xBF",
                "\x2B\x2F\x76",
                "\xF7\x64\x4C",
                "\x0E\xFE\xFF",
                "\xFB\xEE\x28",
                "\x00\x00\xFE\xFF",
                "\xDD\x73\x66\x73",
            );

you can read more here Byte order mark in unicode

Its really annoying how many things you miss until you start working in multi national environment ;-)

Comments

Post new comment

Image CAPTCHA

About the author

Artur Ejsmont

Hi, my name is Artur Ejsmont,
welcome to my blog.

I am a passionate software engineer living in Sydney and working for Yahoo! Drop me a line or leave a comment.

Follow my RSS