Using Multi-Byte Character Sets in PHP (Unicode, UTF-8, etc)

Wednesday, 15 October 2008, 1:00 pm
The following list details the PHP string functions which could cause problems when handling multi-byte strings. The multi-byte safe alternative is given when available:

mail()
Try mb_send_mail() instead.


strlen()
Try mb_strlen() instead.


strpos()
Try mb_strpos() instead.


strrpos()
Try mb_strrpos() instead.


substr()
Try mb_substr() instead.


strtolower()
Try mb_strtolower() instead.


strtoupper()
Try mb_strtoupper() instead.


substr_count()
Try mb_substr_count() instead.


ereg()
Try mb_ereg() instead.


eregi()
Try mb_eregi() instead.


ereg_replace()
Try mb_ereg_replace() instead.


eregi_replace()
Try mb_eregi_replace() instead.


preg_match()
To avoid having to recompile php with the PCRE UTF-8 flag enabled, you can just add the following sequence at the start of your pattern: (*UTF8) e.g. '/(*UTF8)[[:alnum:]]/' will return true for 'é' where '/[[:alnum:]]/' will return false. Also the /u RegEx option provides UTF-8 awareness. The preg_* functions are contentious, because careful use can be safe. If you are unsure what to do, see mb_eregi() as a possible replacement.


preg_replace()
Please investigate the /u option, as that provides UTF-8 awareness. The preg_* functions are contentious, because careful use can be safe. If you are unsure what to do, see mb_ereg_replace() as a possible replacement.


split()
Try mb_split() instead.


explode()
Try mb_split() instead.


stripos()
Try mb_stripos() instead.


stristr()
Try mb_stristr() instead.


strrchr()
Try mb_strrchr() instead.


strripos()
Try mb_strripos() instead.


strstr()
Try mb_strstr() instead.


strrev()
View comments for possible workarounds.


wordwrap()
View comments for possible workarounds.


chunk_split()
No known workarounds yet.


ucfirst()
View the comment posted on "11-Feb-2008 04:31" for a possible workaround.


lcfirst()
This function is flagged because its companion function (ucfirst) is not safe. However, this function is untested.


trim(), rtrim() and ltrim()
May be multi-byte safe if you use UTF-8 only (multi-byte UTF-8 characters contain no byte sequences that resemble white space). Avoid UTF-16 & UTF-32, among others.


strip_tags()
It may be multi-byte safe if you use UTF-8 only (multi-byte UTF-8 characters contain no byte sequences that resemble less-than or greater-than symbols). Avoid UTF-16 & UTF-32, among others.


ucwords()
Try this code instead:
$str = mb_convert_case($str, MB_CASE_TITLE, "UTF-8");

Please enter your comment in the box below. Comments will be moderated before going live. Thanks for your feedback!

Cancel Post

/xkcd/ Results Age