UTF8 Multibyte behavior
Simple way to make sure all your persisted data is in utf8 / same encoding
As I was developing my CMS system, and we began to get foreign customers, I rather quickly realised, that storing all data in ISO-8859-15 (danish charset) was a rather bad idea.
But just the though about adding UTF-8 handling in each of my CRUD methods gave me the chills :)
As the bleeding-edge guy I am, I was of course developing in 1.2.x.x, so why not take full advantage of this, and develop a behavior, that would seamlessly be intigrated in any application, and automatic handle all the trivial work of encoding / decoding data.
** Updated 23/03/2007 **
A few bugfixes, and new param array (read/write)
A few examples:
Would use default settings:
Encode to UTF-8 on save
Decode to ISO-8859-15 on read
But just the though about adding UTF-8 handling in each of my CRUD methods gave me the chills :)
As the bleeding-edge guy I am, I was of course developing in 1.2.x.x, so why not take full advantage of this, and develop a behavior, that would seamlessly be intigrated in any application, and automatic handle all the trivial work of encoding / decoding data.
** Updated 23/03/2007 **
A few bugfixes, and new param array (read/write)
A few examples:
Example 1
Would use default settings:
Encode to UTF-8 on save
Decode to ISO-8859-15 on read
Model Class:
Download code
<?php
class Page extends Model {
var $name = "Page";
var $actsAs = array('Utf8');
}
?>
Example 2
Model Class:
Download code
<?php
class Page extends Model {
var $name = "Page";
var $actsAs = array('Utf8' => array('save' => array('convertTo' => 'UTF-8') ) );
}
?>
Model Class:
Download code
<?php
/**
* @copyright Copyright (c) 2007 Enovo
* @author Christian Winther
* @link http://www.enovo.dk
* @filesource
* @since 1.1
* @package sw.model.behaviors
* @modifiedby $LastChangedBy:$
* @lastmodified $Date:$
* @svn $Id:$
*/
class Utf8Behavior extends ModelBehavior {
/**
* Default settings for our model
*
* 'convertTo' is the target output encoding
*
* 'primaryOnly' is if 'finder' should only convert if its the primary model
*
* 'use_mbstring' enable / disable the use of mbstring to decode strings
*
* 'convertFrom' can either be
* - auto :
* Attemps to auto detect the source encoding
* - array('UTF-8',...')
* A list of possible encodings to try
*
* @var array
*/
var $defaultSettings = array(
'save' => array(
'useMbstring' => false,
'convert' => true,
'convertTo' => 'UTF-8',
'convertFrom' => array('ISO-8859-15','UTF-8')
),
'read' => array(
'useMbstring' => false,
'convert' => true,
'primaryOnly' => true,
'convertTo' => 'ISO-8859-15',
'convertFrom' => array('UTF-8')
)
);
/**
* List of valid encodings
*
* @var array
*/
var $validEncodings;
/**
* List of model settings
*
* @var array
*/
var $settings = array();
/**
* Setup callback
*
* @param AppModel $model
* @param array $config
*/
function setup(&$model, $config = array() )
{
if( true === empty( $config ) ) { $config = array(); }
// Merge user settings with default
$settings = am($this->defaultSettings, $config );
foreach ( $settings AS $mode )
{
if( true === $mode['useMbstring'] && false !== $mode['convertTo'] )
{
if( false === function_exists('mb_convert_encoding') )
{
trigger_error('Sorry, your PHP version does not support mbstring functions. Please read notes at http://php.net/mbstring',E_USER_ERROR);
}
// Check if we have a list of all valid encodings supported by PHP
if( true === empty( $this->validEncodings ) )
{
// Build the list of valid encodings
$this->validEncodings = mb_list_encodings();
}
// Check if we have valid encodings in our list
if( false === array_search( $mode['convertTo'], $this->validEncodings ) )
{
trigger_error('Invalid target encoding for "'.$model->name.'::find" - '. $mode['convertTo'] .' is not valid!', E_USER_ERROR );
}
}
}
$this->settings[ $model->name ] = $settings;
}
/**
* Callback for when model is saving
*
* @param AppModel $model
*/
function beforeSave(&$model)
{
$settings = $this->settings[ $model->name ]['save'];
if( false === $settings['convertTo'] ) {
return true;
}
// Should we encode using mbstring ?
if( true === $settings['useMbstring'] )
{
$model->data = $this->doMultibyte( $model->data, $settings );
}
else
{
$model->data = $this->doEncode( $model->data, $settings );
}
return true;
}
/**
* Callback for when model is reading
*
* @param AppModel $model
* @param array $results
* @param boolean $primary
*/
function afterFind(&$model, $results, $primary)
{
$settings = $this->settings[ $model->name ]['read'];
if( false === $settings['convert'] )
{
return $results;
}
// Check if we should only handle primary model data
if( true === $settings['primaryOnly'] && true !== $primary ) {
return $results;
}
// Should we decode using mbstring ?
if( true === $settings['useMbstring'] ) {
return $this->doMultibyte( $results, $settings );
}
// Normal utf8 decode to ISO-8859-1
return $this->doDecode( $results, $settings );
}
/**
* Decode UTF-8 to another encoding, with multibyte support
*
* @param mixed $data
* @param array $settings
* @return mixed
*/
function doMultibyte( $data, $settings ) {
if( true === is_array( $data ) ) {
if( 0 === count( $data ) ) {
return $data;
}
foreach ( $data AS $key => $name ) {
$data[ $key ] = $this->doDecode( $name, $settings );
}
return $data;
}
return mb_convert_encoding( $data, $settings['convertTo'], $settings['convertFrom'] );
}
/**
* Decode UTF-8 back to ISO-8859-1 single-byte encoding
*
* @param mixed $data
* @param array $settings
* @return mixed
*/
function doDecode( $data, $settings ) {
if( true === is_array( $data ) ) {
if( 0 === count( $data ) ) {
return $data;
}
foreach ( $data AS $key => $name ) {
$data[ $key ] = $this->doDecode( $name, $settings );
}
return $data;
}
return utf8_decode($data);
}
/**
* Do the converting of data to UTF-8, recursive
*
* @param array $data
* @param array $settings
* @return array
*/
function doEncode( $data, $settings ) {
if( true === is_array( $data ) ) {
if( 0 === count( $data ) ) {
return $data;
}
foreach ( $data AS $key => $name ) {
$data[ $key ] = $this->doEncode( $name, $settings );
}
return $data;
}
if( true === $this->isUTF8( $data ) ) {
return $data;
}
return utf8_encode($data);
}
/**
* Method to check if a string is UTF-8
*
* @param string $string
* @return boolean
*/
function isUTF8($string)
{
// from http://w3.org/International/questions/qa-forms-utf-8.html
return 0 != preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
}
?>
Comments
Question
1 UTF8
class AppModel extends Model{
static $utf8IsSet = false;
function __construct(){
if(!self::$utf8IsSet) {
$this->execute("SET NAMES 'utf8'");
self::$utf8IsSet = true;
}
parent::__construct();
}
}
code by Andreas Waidelich.
http://groups.google.com/group/cake-php/browse_thread/thread/902d931ff87eb8ac/c4ca2c14891df179
You have to upgrade db and tables to UTF-8 charset
see http://drupal.org/node/105151 for example.
It is possible that some modification of
/app/views/layouts/default.thtml
is needed:
echo $html->charset('UTF-8');
I'm not sure are there any other methods to change html page encoding for the application.
So far everything looks fine, my first tiny app is capable to read unicode text input from a browser and render it back to the user.
Comment
2 utf8 all over
function __construct($id = null, $table = null, $ds = null)
{
parent::__construct($id, $table, $ds);
if (!defined('MYSQL_SET_NAMES_UTF8') && $this->useTable!==false)
{
$this->execute("SET NAMES 'UTF8'");
define('MYSQL_SET_NAMES_UTF8', true);
}
}
Comment
3 Take a look at DBOs
As I keep all my data in UTF-8 (MySQL 4.1.xxx) I used to do almost the same before I took a dive deeper in the code.
Now in my database.php I add a key 'encoding' like this:
var $default = array(
'driver' => 'mysqli',
'persistent' => false,
'host' => 'localhost',
'login' => 'root',
'password' => 'pass',
'database' => 'baza',
'encoding' => 'UTF8',
'prefix' => ''
);
See setEncoding method for mysql(i) and postrges DBOs.