Changing Mailman Python Scripts for Virtual Host Support
Using PHP pspell Spell Check Functions with a Custom Dictionary
Native Linux Space Warfare: Freespace 2
A Simple ISAPI Filter for Authentication on IIS
ENUMs, User Preferences, and the MySQL SET Datatype
Development Resource Project

Using Multi-Byte Character Sets in PHP (Unicode, UTF-8, etc)

Wednesday, 15 October 08, 9:15 am
The following list details the PHP string functions which could cause problems when handling multi-byte strings. The multi-byte safe alternative is given when available:

Try mb_send_mail() instead.


Try mb_strlen() instead.


Try mb_strpos() instead.


Try mb_strrpos() instead.


Try mb_substr() instead.


Try mb_strtolower() instead.


Try mb_strtoupper() instead.


Try mb_substr_count() instead.


Try mb_ereg() instead.


Try mb_eregi() instead.


Try mb_ereg_replace() instead.


Try mb_eregi_replace() instead.


To avoid having to recompile php with the PCRE UTF-8 flag enabled, you can just add the following sequence at the start of your pattern: (*UTF8) e.g. '/(*UTF8)[[:alnum:]]/' will return true for 'é' where '/[[:alnum:]]/' will return false. Also the /u RegEx option provides UTF-8 awareness. The preg_* functions are contentious, because careful use can be safe. If you are unsure what to do, see mb_eregi() as a possible replacement.


Please investigate the /u option, as that provides UTF-8 awareness. The preg_* functions are contentious, because careful use can be safe. If you are unsure what to do, see mb_ereg_replace() as a possible replacement.


Try mb_split() instead.


Try mb_split() instead.


Try mb_stripos() instead.


Try mb_stristr() instead.


Try mb_strrchr() instead.


Try mb_strripos() instead.


Try mb_strstr() instead.


View comments for possible workarounds.


View comments for possible workarounds.


No known workarounds yet.


View the comment posted on "11-Feb-2008 04:31" for a possible workaround.


This function is flagged because its companion function (ucfirst) is not safe. However, this function is untested.


May be multi-byte safe if you use UTF-8 only (multi-byte UTF-8 characters contain no byte sequences that resemble white space). Avoid UTF-16 & UTF-32, among others.


It may be multi-byte safe if you use UTF-8 only (multi-byte UTF-8 characters contain no byte sequences that resemble less-than or greater-than symbols). Avoid UTF-16 & UTF-32, among others.


Try this code instead:
$str = mb_convert_case($str, MB_CASE_TITLE, "UTF-8");

Please enter your comment in the box below. Comments will be moderated before going live. Thanks for your feedback!

Cancel Post

/xkcd/ Physics vs. Magic