AddDefaultCharset is bad, bad, bad

For the nth time I've been jabbed in the head by the Apache default config file turning on the AddDefaultCharset setting. The httpd.conf that Debian distributes has the same bug.

The problem is caused by a line like this in the Apache config:

AddDefaultCharset on

With this turned on, Apache will put an indication of the character encoding of all HTML documents it sends out, in the HTTP headers:

Content-Type: text/html; charset=iso-8859-1

Which sounds reasonable—but it's not the ‘default character set’, it's an override of the character set. So whatever character encodings your documents use, this tells the browser to treat them as ISO-8859-1, even if they have a <meta> element in the header saying otherwise. This is not helpful.

By the way, I'm not the only one who thinks this is wrong (see AddDefaultCharset considered harmful and the Bugzilla records linked from there).

This stuff is most likely to bite if your web pages are in UTF-8 (and mine usually are because it tends to make things simpler) or if you're using some national character set other than latin1 (so if you've got pages in Greek or Japanese or whatever).

You can test for this by looking at the HTTP headers returned when you run the HEAD command (usually available on Linux boxen with Perl installed), or by telnetting to the server on port 80.

Apache as a proxy

Note that even if your web pages are served by some backend webserver and proxied through Apache, it will still add one of these things if it feels like it, so you definitely want to turn it off in the proxy. If Apache is merely proxying a document through from another webserver it's even less likely to know what encoding it's in than if it's serving it off the disk.

What to do about it

My advice, for what it's worth:

  • Turn off the AddDefaultCharset option by commenting it out in httpd.conf.
  • Try to make sure every HTML file you create (whether static or dynamic) has a <meta> element in to specify the correct encoding. For UTF-8 it looks like this in XHTML:

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  • If your pages are in XHTML then put an XML declaration in too. It will mostly be ignored by browsers, but will allow general XML processors to work out what encoding the document uses. For UTF-8 it should look like this:

    <?xml version="1.0" encoding="UTF-8"?>
  • If your web pages are in English then life might just be simpler if you stick to ASCII. You can still use character references (&#8212;) or entity references (&mdash;) to get non-ASCII characters.
  • If you've got an existing website which doesn't do any of the above, and if you're really sure that all the HTML will always be in the same encoding, then use AddDefaultCharset on the appropriate virtual host, not globally.

Security concerns

There have been security concerns voiced about documents which don't specify a character encoding, which may be why Apache has the AddDefaultCharset thing turned on by default. Apparently the confusion about the encoding could be exploited as a cross-site scripting vulnerability. I'm not too worried about this myself—the CERT warning about the issue only mentions character encodings in passing, concentrating on lack of validation in scripts as the principle bad.

Just for reference here are the relevant documents:

< Silsna: first ever texture | Ideas for describing relational database schemas in XML >

Miniblog

(nuggets of inanity)

Tuesday Apr 24th 2007, 16:54 »
Just took the annual web design survey that AListApart do. I don't realy consider myself to be a web designer, but I have been doing a lot of HTML and CSS lately.
Monday Apr 23rd 2007, 18:23 »
Strange, there appears to be a bare-knuckle boxing match going on in the field outside my flat. Wish they wouldn't make so much noise about it.
Thursday Mar 1st 2007, 18:47 »
“In its written form, Hebrew has no vowels, making it the ideal language for texting.”
—Said in jest on some Radio 4 programme just now.

Archive: 2007 · 2006 · 2005 · 2004
Feed