#!/bin/blog --author=p4bl0

the blog where all numbers are written in base 10

PHP and MySQL: going fully UTF-8 for real

by p4bl0, on
last update by p4bl0, on

I known, PHP ain't pretty. But I use it, and many people do. Here is a quick description of how to use UTF-8 everywhere for real and stop worrying about encoding problems. This post is mostly a reminder for my future self, but since it could also be useful to other people I thought I would blog it.

Let's start with MySQL. First, tell MySQL to use utf-8 internally by adding those two lines in the [mysqld] section of your "my.cnf" configuration file (which is in /etc/mysql under Debian).

collation_server = utf8_unicode_ci
character_set_server = utf8

Then restart your MySQL server. When you start interacting with it, start with this two queries (put them at the beginning of your sql scripts):

SET NAMES 'utf8';
CHARSET 'utf8';

Now when you create your database also tell MySQL to use UTF-8 as the default charset for the database:

CREATE DATABASE `my_db` DEFAULT CHARACTER SET 'utf8';

Also when you create tables, don't forget to specify UTF-8 as the default charset:

CREATE TABLE `my_table` (
  -- ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Now on the PHP side, each time you open a connection to your MySQL database, tell MySQL to use UTF-8 for the connection. This can be done by executing the same SET NAMES query as before, or using a function/methods which does this depending on what you use to connect fo MySQL from PHP. I personnaly have a wrapper class for PHP's mysql_* functions so I do something like this:

mysql_set_charset('utf8', $connection);

Where $connection has been returned by a call to mysql_connect or mysql_pconnect.

Here is a simple function which tries to encode a string as UTF-8 if it is not already encoded that way, to use on strings submitted by your users through forms:

function u ($str)
{
  if (($encoding = mb_detect_encoding($str)) === false)
    return mb_convert_encoding($str, 'UTF-8', 'auto');
  if ($encoding == 'UTF-8')
    return $str;
  return mb_convert_encoding($str, 'UTF-8', $encoding);
}

Now to serve the webpages as UTF-8 we need to tell the browsers about the charset using HTTP headers:

header('Content-type: text/html; charset=utf-8');

Saying it again in the HTML <header> section can't do any harm:

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

Then don't forget to tell the browser to also submit forms using UTF-8 (it should do so already because it's the page encoding, but once again, it can't do any harm to specify it one more time):

<form accept-charset="UTF-8" ...>

And this is it! Of course you have to edit your file with a unicode compliant editor (Emacs!), use a unicode terminal (urxvt!), tell your shell to use UTF-8 in your LANG environment variable and also tell screen (if you use it) to use unicode by adding the line encoding UTF-8 in your ".screenrc".

If you still have encoding problems now, quit using those extra-terrestrial alphabets.

If you have any remark about this blog or if you want to react to this article feel free to send me an email at "pablo <r@uzy dot me>".