Removing Byte-Order-Mark using sed

Working with UTF-8-encoded PHP files in web applications, a common, hard-to-track-down error is the following: “Headers already sent” or “Cannot modify header information“. This usually happens during a call to the function header(), which manipulates the HTTP header.

One reason for this is that the UTF-8 file starts with an invisible(!) byte order mark (BOM) consisting of the three bytes 0xEF,0xBB,0xBF. The BOM can be removed by opening the file in a suitbale text editor and unticking the Add Byte Order Mark (BOM) .option (or similar).

A more convenient way using sed is the following:

sed -i '1 s/^\xef\xbb\xbf//' utf8_file.txt

(-i enables in-place operation of sed; 1 denotes that one replacement should happen; ^ denotes the start of a line)

Example

Let’s consider a file consisting of two lines (‘A’, ‘B’) stored with the BOM:

<BOM>A
B

Investigating this file with the hex tool od, :

$ od -t c -t x1 testfile.txt

we obtain the following output:

0000000 357 273 277   A  \n   B  \n
         ef  bb  bf  41  0a  42  0a
0000007

The three BOM bytes are clearly visible.

After running

sed -i '1 s/^\xef\xbb\xbf//' testfile.txt

The output looks as follows, proving that the BOM is gone:

0000000   A  \n   B  \n
         41  0a  42  0a
0000004

References

Leave a Reply