PHP and MySQL: going fully UTF-8 for real
I known, PHP ain't pretty. But I use it, and many people do. Here is a quick description of how to use UTF-8 everywhere for real and stop worrying about encoding problems. This post is mostly a reminder for my future self, but since it could also be useful to other people I thought I would blog it.
Let's start with MySQL. First, tell MySQL to use utf-8 internally by adding
those two lines in the [mysqld]
section of your
"my.cnf" configuration file (which is in /etc/mysql under Debian).
collation_server = utf8_unicode_ci character_set_server = utf8
Then restart your MySQL server. When you start interacting with it, start with this two queries (put them at the beginning of your sql scripts):
SET NAMES 'utf8'; CHARSET 'utf8';
Now when you create your database also tell MySQL to use UTF-8 as the default charset for the database:
CREATE DATABASE `my_db` DEFAULT CHARACTER SET 'utf8';
Also when you create tables, don't forget to specify UTF-8 as the default charset:
CREATE TABLE `my_table` ( -- ... ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Now on the PHP side, each time you open a connection to your MySQL database,
tell MySQL to use UTF-8 for the connection. This can be done by executing the
same SET NAMES
query as before, or using a function/methods
which does this depending on what you use to connect fo MySQL from PHP. I
personnaly have a wrapper class for PHP's mysql_*
functions so I
do something like this:
mysql_set_charset('utf8', $connection);
Where $connection
has been returned by a call
to mysql_connect
or mysql_pconnect
.
Here is a simple function which tries to encode a string as UTF-8 if it is not already encoded that way, to use on strings submitted by your users through forms:
function u ($str) { if (($encoding = mb_detect_encoding($str)) === false) return mb_convert_encoding($str, 'UTF-8', 'auto'); if ($encoding == 'UTF-8') return $str; return mb_convert_encoding($str, 'UTF-8', $encoding); }
Now to serve the webpages as UTF-8 we need to tell the browsers about the charset using HTTP headers:
header('Content-type: text/html; charset=utf-8');
Saying it again in the HTML <header>
section can't do any
harm:
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
Then don't forget to tell the browser to also submit forms using UTF-8 (it should do so already because it's the page encoding, but once again, it can't do any harm to specify it one more time):
<form accept-charset="UTF-8" ...>
And this is it! Of course you have to edit your file with a unicode compliant
editor (Emacs!), use a unicode terminal (urxvt!), tell your shell to use
UTF-8 in your LANG
environment variable and also tell screen (if
you use it) to use unicode by adding the line encoding UTF-8
in
your ".screenrc".
If you still have encoding problems now, quit using those extra-terrestrial alphabets.