View Full Version : Byte-Order Mark found in UTF-8 File.
CarlSeiler
08-18-2007, 06:31 AM
I'm converting one of my pages from handwritten HTML 4.01 to php that generates XHTML 1.0 Strict. When I went to validate it at W3C, after cleaning up some closing slashes I'd missed on my img tags, I got a validation, but also got this warning message with a little yellow bang:Byte-Order Mark found in UTF-8 File. The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.
I have no idea what that means, but indeed the file appears to have three little extra high-bit characters at the start. I don't have them in my php files, so I guess the server is adding them. In fact, I noticed that before I cleaned up my pages so it would validate correctly, Firefox would even display the characters. Now that it validates, Firefox doesn't display them, but I do get the warning message in the validator. Dillo 0.8 under Linux displays the characters.
How do I stop the BOM? Should I even worry about it?
Thanks,
Carl
ktinkel
08-18-2007, 07:16 AM
I have no idea what that means, but indeed the file appears to have three little extra high-bit characters at the start. I don't have them in my php files, so I guess the server is adding them.
How do I stop the BOM? Should I even worry about it?We had a discussion about a related topic after we shifted to a new host a couple of months ago. Turned out I had created a different error from yours by using “UTF-8 no BOM” so that it was picked up by header.php. That is a no-no in that context, although it is a good way to encode HTML files.
Marjolein figured it all out. Not sure it is relevant for you, but look at the “RSS feeds are broken!!! (http://www.desktoppublishingforum.com/bb/showthread.php?t=4440)” thread — maybe it will help.
iamback
08-18-2007, 07:28 AM
Byte-Order Mark found in UTF-8 File. The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.
I have no idea what that means, but indeed the file appears to have three little extra high-bit characters at the start.
(...)
How do I stop the BOM? Should I even worry about it?
Yes, you should worry about it as some browsers cannot handle it properly. You may see completely broken styling in IE for instance.
Strictly speaking, a file that's UTF-8 encoded does not need a byte-order mark (an indication for machines how to interpret the bytes in the file, as there's only one way to do that for UTF-8 encoded files) but some editors add one anyway when you specify UTF-8 encoding, or simlpy do so by default. It's quite possible your PHP files have it too, but browsers never see that and PHP doesn't care.
How to stop it depends on your editor: most editors have a way of telling it to save a file encoded as UTF-8 but not to write a BOM. Consult you're editor's documentation (or user support, forum, or whatever); if it can't do it, there are free (and non-free) editors that can do it for you. (For instance, I use both UltraEdit and Crimson Editor, and both are capable of writing (and editing) UTF-8 encoded files and not writing a BOM.
CarlSeiler
08-18-2007, 10:27 AM
Thanks for the posts, guys. I didn't think it was my editor, but after reading here and also this post at Stanford, (http://www.stanford.edu/%7Elaurik/fsmbook/errata/BOM.html)I found that it does seem to be something my editor (Notepad++) is doing.
It turns out that I'm a whole version of Notepad++ behind, and the newer versions allow you to pick UTF-8 without BOM. I'll be upgrading in a few minutes.
Carl
Michael Rowley
08-18-2007, 01:42 PM
Marjolein:
Strictly speaking, a file that's UTF-8 encoded does not need a byte-order mark
It doesn't need one in any manner of speech! But what happens when a file is UTF-16 encoded? Do some browsers give up, and if so, which?
iamback
08-19-2007, 10:36 AM
But what happens when a file is UTF-16 encoded? Do some browsers give up, and if so, which?A UTF-16 encoded file must have a byte-order mark. In the first place, to tell any program that tries to open it that it is UTF-16 encoded. If there is no BOM, there's no way to tell even whether it is supposed to be a text file or a binary file.
I have no idea of browser support for UTF-16, let alone UTF-16 without a BOM (which could be said not to be UTF-16 as a result). Is that even interesting? Are there any UTF-16 documents on the web? How many as compared to UTF-8 encoded documents? How many of those do not have a BOM? Should a browser care?
Michael Rowley
08-19-2007, 03:12 PM
Marjolein:
A UTF-16 encoded file must have a byte-order mark
Of course! But if UTF-8 doesn't use BOMs, and no browser supports BOMs, it follows that a BOM is not any use at all in the language browsers use. I thought you might have some views on the matter.
dthomsen8
08-19-2007, 06:03 PM
... It turns out that I'm a whole version of Notepad++ behind, and the newer versions allow you to pick UTF-8 without BOM. ...
With the newest Notepad++, you can pick UTF-8 without BOM. If you are working on XHTML or HTML, why are you picking UTF-8?
Shane Stanley
08-19-2007, 11:53 PM
A UTF-16 encoded file must have a byte-order mark.
A UTF-16 encoded file should have a byte-order mark, but sadly it is not a requirement.
Shane
iamback
08-20-2007, 12:02 AM
Of course! But if UTF-8 doesn't use BOMs, and no browser supports BOMs, it follows that a BOM is not any use at all in the language browsers use. I thought you might have some views on the matter.Well, my "view" (experience, rather) is that UTF-8 is widely used on the web and doesn't need but does allow a BOM (which merely states its encoding), so browsers should support (or at least ignore and not be confused by) it; it is not true that "no browser supports BOMs" and UTF-16 is rarely (if ever) used on the web, and finally, there is no "the language" that browsers use - most can use multiple languages, not limited to (X)HTML, CSS and JavaScript - but the languages browsers use have nothing to do with support (or lack of support) for BOMs or different encodings
iamback
08-20-2007, 12:05 AM
With the newest Notepad++, you can pick UTF-8 without BOM. If you are working on XHTML or HTML, why are you picking UTF-8?One has nothing to do with the other - XHTML or HTML is the markup, UTF-8 is the encoding, and required for multi-lingual support. The characters that make up (X)HTML markup themselves mostly are 7-bit ASCII (itself valid UTF-8) (apart from some allowed attribute values where allowed characters are determined by the document's encoding). So UTF-8 refers mostly to content while (X)HTML refers mostly to markup.
iamback
08-20-2007, 12:08 AM
A UTF-16 encoded file should have a byte-order mark, but sadly it is not a requirement.Oh, lovely - endless cross-platform problems. I assume programs will then assume their own platform's BO, but they must clearly guess at encoding (or regard such a file as binary)...
CarlSeiler
08-27-2007, 11:29 AM
With the newest Notepad++, you can pick UTF-8 without BOM. If you are working on XHTML or HTML, why are you picking UTF-8?
Several of my pages are in Spanish or Spanish and English. UTF-8 makes it easier for me to handle these.
vBulletin® v3.8.7, Copyright ©2000-2013, vBulletin Solutions, Inc.