For the nth time I've been jabbed in the head by the Apache default config file turning on the AddDefaultCharset setting. The httpd.conf that Debian distributes has the same bug.
The problem is caused by a line like this in the Apache config:
AddDefaultCharset on
With this turned on, Apache will put an indication of the character encoding of all HTML documents it sends out, in the HTTP headers:
Content-Type: text/html; charset=iso-8859-1
Which sounds reasonable—but it's not the ‘default
character set’, it's an override of the character set.
So whatever character encodings your documents use, this tells
the browser to treat them as ISO-8859-1, even if they have
a <meta> element in the header saying otherwise.
This is not helpful.
By the way, I'm not the only one who thinks this is wrong (see AddDefaultCharset considered harmful and the Bugzilla records linked from there).
This stuff is most likely to bite if your web pages are in UTF-8 (and mine usually are because it tends to make things simpler) or if you're using some national character set other than latin1 (so if you've got pages in Greek or Japanese or whatever).
You can test for this by looking at the HTTP headers returned when
you run the HEAD command (usually available on Linux boxen
with Perl installed), or by telnetting to the server on port 80.
Apache as a proxy
Note that even if your web pages are served by some backend webserver and proxied through Apache, it will still add one of these things if it feels like it, so you definitely want to turn it off in the proxy. If Apache is merely proxying a document through from another webserver it's even less likely to know what encoding it's in than if it's serving it off the disk.
What to do about it
My advice, for what it's worth:
- Turn off the AddDefaultCharset option by commenting it out in httpd.conf.
-
Try to make sure every HTML file you create (whether static or dynamic) has a
<meta>element in to specify the correct encoding. For UTF-8 it looks like this in XHTML:<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
-
If your pages are in XHTML then put an XML declaration in too. It will mostly be ignored by browsers, but will allow general XML processors to work out what encoding the document uses. For UTF-8 it should look like this:
<?xml version="1.0" encoding="UTF-8"?>
- If your web pages are in English then life might just be simpler
if you stick to ASCII. You can still use character references
(
—) or entity references (—) to get non-ASCII characters. - If you've got an existing website which doesn't do any of the above, and if you're really sure that all the HTML will always be in the same encoding, then use AddDefaultCharset on the appropriate virtual host, not globally.
Security concerns
There have been security concerns voiced about documents which don't specify a character encoding, which may be why Apache has the AddDefaultCharset thing turned on by default. Apparently the confusion about the encoding could be exploited as a cross-site scripting vulnerability. I'm not too worried about this myself—the CERT warning about the issue only mentions character encodings in passing, concentrating on lack of validation in scripts as the principle bad.
Just for reference here are the relevant documents: