Using Multi-Byte Character Sets in PHP (Unicode, UTF-8, etc)

Wednesday, 15 October 2008, 1:00 pm

The following list details the PHP string functions which could cause problems when handling multi-byte strings. The multi-byte safe alternative is given when available:

mail()

Try mb_send_mail() instead.

strlen()

Try mb_strlen() instead.

strpos()

Try mb_strpos() instead.

strrpos()

Try mb_strrpos() instead.

substr()

Try mb_substr() instead.

strtolower()

Try mb_strtolower() instead.

strtoupper()

Try mb_strtoupper() instead.

substr_count()

Try mb_substr_count() instead.

ereg()

Try mb_ereg() instead.

eregi()

Try mb_eregi() instead.

ereg_replace()

Try mb_ereg_replace() instead.

eregi_replace()

Try mb_eregi_replace() instead.

preg_match()

To avoid having to recompile php with the PCRE UTF-8 flag enabled, you can just add the following sequence at the start of your pattern: (*UTF8) e.g. '/(*UTF8)[[:alnum:]]/' will return true for 'é' where '/[[:alnum:]]/' will return false. Also the /u RegEx option provides UTF-8 awareness. The preg_* functions are contentious, because careful use can be safe. If you are unsure what to do, see mb_eregi() as a possible replacement.

preg_replace()

Please investigate the /u option, as that provides UTF-8 awareness. The preg_* functions are contentious, because careful use can be safe. If you are unsure what to do, see mb_ereg_replace() as a possible replacement.

split()

Try mb_split() instead.

explode()

Try mb_split() instead.

stripos()

Try mb_stripos() instead.

stristr()

Try mb_stristr() instead.

strrchr()

Try mb_strrchr() instead.

strripos()

Try mb_strripos() instead.

strstr()

Try mb_strstr() instead.

strrev()

View comments for possible workarounds.

wordwrap()

View comments for possible workarounds.

chunk_split()

No known workarounds yet.

ucfirst()

View the comment posted on "11-Feb-2008 04:31" for a possible workaround.

lcfirst()

This function is flagged because its companion function (ucfirst) is not safe. However, this function is untested.

trim(), rtrim() and ltrim()

May be multi-byte safe if you use UTF-8 only (multi-byte UTF-8 characters contain no byte sequences that resemble white space). Avoid UTF-16 & UTF-32, among others.

strip_tags()

It may be multi-byte safe if you use UTF-8 only (multi-byte UTF-8 characters contain no byte sequences that resemble less-than or greater-than symbols). Avoid UTF-16 & UTF-32, among others.

ucwords()

Try this code instead:

$str = mb_convert_case($str, MB_CASE_TITLE, "UTF-8");

Articles from Other Sites

Gabriele Romanato's weblog Single quotes vs double quotes in PHP Transform gedit into a feature-rich coder's companion a simple denial of service attack for PHP on Apache Neat CSS3 + HTML5 Animation Quality Assurance Tools for PHP

Using Multi-Byte Character Sets in PHP (Unicode, UTF-8, etc)

Related Articles

Articles from Other Sites

Please enter your comment in the box below. Comments will be moderated before going live. Thanks for your feedback!

/xkcd/ 720 Ollie