Book Review: How to Implement Design Patterns in PHP
Enforce Coding Standards with PHP_CodeSniffer and Eclipse IDE on Ubuntu Linux
Installing Xdebug for use with Eclipse or Netbeans on Linux
Nice n' Easy JQuery Image Rotator
Changing Mailman Python Scripts for Virtual Host Support
Native Linux Space Warfare: Freespace 2

Using Multi-Byte Character Sets in PHP (Unicode, UTF-8, etc)

Wednesday, 15 October 08, 9:15 am
The following list details the PHP string functions which could cause problems when handling multi-byte strings. The multi-byte safe alternative is given when available:

Try mb_send_mail() instead.


Try mb_strlen() instead.


Try mb_strpos() instead.


Try mb_strrpos() instead.


Try mb_substr() instead.


Try mb_strtolower() instead.


Try mb_strtoupper() instead.


Try mb_substr_count() instead.


Try mb_ereg() instead.


Try mb_eregi() instead.


Try mb_ereg_replace() instead.


Try mb_eregi_replace() instead.


To avoid having to recompile php with the PCRE UTF-8 flag enabled, you can just add the following sequence at the start of your pattern: (*UTF8) e.g. '/(*UTF8)[[:alnum:]]/' will return true for 'é' where '/[[:alnum:]]/' will return false. Also the /u RegEx option provides UTF-8 awareness. The preg_* functions are contentious, because careful use can be safe. If you are unsure what to do, see mb_eregi() as a possible replacement.


Please investigate the /u option, as that provides UTF-8 awareness. The preg_* functions are contentious, because careful use can be safe. If you are unsure what to do, see mb_ereg_replace() as a possible replacement.


Try mb_split() instead.


Try mb_split() instead.


Try mb_stripos() instead.


Try mb_stristr() instead.


Try mb_strrchr() instead.


Try mb_strripos() instead.


Try mb_strstr() instead.


View comments for possible workarounds.


View comments for possible workarounds.


No known workarounds yet.


View the comment posted on "11-Feb-2008 04:31" for a possible workaround.


This function is flagged because its companion function (ucfirst) is not safe. However, this function is untested.


May be multi-byte safe if you use UTF-8 only (multi-byte UTF-8 characters contain no byte sequences that resemble white space). Avoid UTF-16 & UTF-32, among others.


It may be multi-byte safe if you use UTF-8 only (multi-byte UTF-8 characters contain no byte sequences that resemble less-than or greater-than symbols). Avoid UTF-16 & UTF-32, among others.


Try this code instead:
$str = mb_convert_case($str, MB_CASE_TITLE, "UTF-8");

 
Leave Comment

/xkcd/ Eclipse Science

About This Page