UTF8 Multibyte behavior

By Christian Winther aka "Jippi"
Simple way to make sure all your persisted data is in utf8 / same encoding
As I was developing my CMS system, and we began to get foreign customers, I rather quickly realised, that storing all data in ISO-8859-15 (danish charset) was a rather bad idea.

But just the though about adding UTF-8 handling in each of my CRUD methods gave me the chills :)

As the bleeding-edge guy I am, I was of course developing in 1.2.x.x, so why not take full advantage of this, and develop a behavior, that would seamlessly be intigrated in any application, and automatic handle all the trivial work of encoding / decoding data.

** Updated 23/03/2007 **
A few bugfixes, and new param array (read/write)


A few examples:

Example 1


Would use default settings:
Encode to UTF-8 on save
Decode to ISO-8859-15 on read

Model Class:

Download code <?php 
class Page extends Model {
    var 
$name   "Page";
    var 
$actsAs = array('Utf8');
}
?>


Example 2


Model Class:

Download code <?php 
class Page extends Model {
    var 
$name   "Page";
    var 
$actsAs = array('Utf8' => array('save' => array('convertTo' => 'UTF-8') ) );
}
?>


Model Class:

Download code <?php 
/**
 * @copyright       Copyright (c) 2007 Enovo
 * @author          Christian Winther
 * @link            http://www.enovo.dk
 * @filesource
 * @since           1.1
 * @package         sw.model.behaviors
 * @modifiedby      $LastChangedBy:$
 * @lastmodified    $Date:$
 * @svn             $Id:$
 */
class Utf8Behavior extends ModelBehavior {
    
/**
     * Default settings for our model
     *
     * 'convertTo' is the target output encoding
     *
     * 'primaryOnly' is if 'finder' should only convert if its the primary model
     *
     * 'use_mbstring' enable / disable the use of mbstring to decode strings
     *
     * 'convertFrom' can either be
     *      - auto :
     *          Attemps to auto detect the source encoding
     *      - array('UTF-8',...')
     *          A list of possible encodings to try
     *
     * @var array
     */
    
var $defaultSettings = array(
        
'save'  => array(
            
'useMbstring'   => false,
            
'convert'       => true,
            
'convertTo'     => 'UTF-8',
            
'convertFrom'   => array('ISO-8859-15','UTF-8')
        ),
        
'read'  => array(
            
'useMbstring'   => false,
            
'convert'       => true,
            
'primaryOnly'   => true,
            
'convertTo'     => 'ISO-8859-15',
            
'convertFrom'   => array('UTF-8')
        )
    );

    
/**
     * List of valid encodings
     *
     * @var array
     */
    
var $validEncodings;

    
/**
     * List of model settings
     *
     * @var array
     */
    
var $settings = array();

    
/**
     * Setup callback
     *
     * @param AppModel $model
     * @param array $config
     */
    
function setup(&$model$config = array() )
    {
        if( 
true === empty( $config ) ) { $config = array(); }

        
// Merge user settings with default
        
$settings am($this->defaultSettings$config );

        foreach ( 
$settings AS $mode )
        {
            if( 
true === $mode['useMbstring'] && false !== $mode['convertTo'] )
            {
                if( 
false === function_exists('mb_convert_encoding') )
                {
                    
trigger_error('Sorry, your PHP version does not support mbstring functions. Please read notes at http://php.net/mbstring',E_USER_ERROR);
                }

                
// Check if we have a list of all valid encodings supported by PHP
                
if( true === empty( $this->validEncodings ) )
                {
                   
// Build the list of valid encodings
                   
$this->validEncodings mb_list_encodings();
               }

               
// Check if we have valid encodings in our list
               
if( false === array_search$mode['convertTo'], $this->validEncodings ) )
               {
                    
trigger_error('Invalid target encoding for "'.$model->name.'::find" - '$mode['convertTo'] .' is not valid!'E_USER_ERROR );
               }
            }
        }
        
$this->settings$model->name ] = $settings;
    }

    
/**
     * Callback for when model is saving
     *
     * @param AppModel $model
     */
    
function beforeSave(&$model)
    {
        
$settings $this->settings$model->name ]['save'];
        if( 
false === $settings['convertTo'] ) {
            return 
true;
        }

        
// Should we encode using mbstring ?
        
if( true === $settings['useMbstring'] )
        {
            
$model->data $this->doMultibyte$model->data$settings );
        }
        else
        {
            
$model->data $this->doEncode$model->data$settings );
        }
        return 
true;
    }

    
/**
     * Callback for when model is reading
     *
     * @param AppModel $model
     * @param array $results
     * @param boolean $primary
     */
    
function afterFind(&$model$results$primary)
    {
        
$settings $this->settings$model->name ]['read'];

        if( 
false === $settings['convert'] )
        {
            return 
$results;
        }

        
// Check if we should only handle primary model data
        
if( true === $settings['primaryOnly'] && true !== $primary ) {
            return 
$results;
        }

        
// Should we decode using mbstring ?
        
if( true === $settings['useMbstring'] ) {
            return 
$this->doMultibyte$results$settings );
        }

        
// Normal utf8 decode to ISO-8859-1
        
return $this->doDecode$results$settings );
    }

    
/**
     * Decode UTF-8 to another encoding, with multibyte support
     *
     * @param mixed $data
     * @param array $settings
     * @return mixed
     */
    
function doMultibyte$data$settings ) {
        if( 
true === is_array$data ) ) {
            if( 
=== count$data ) ) {
               return 
$data;
            }
            foreach ( 
$data AS $key => $name ) {
                
$data$key ] = $this->doDecode$name$settings );
            }
            return 
$data;
        }
        return 
mb_convert_encoding$data$settings['convertTo'], $settings['convertFrom'] );
    }

    
/**
     * Decode UTF-8 back to ISO-8859-1 single-byte encoding
     *
     * @param mixed $data
     * @param array $settings
     * @return mixed
     */
    
function doDecode$data$settings ) {
        if( 
true === is_array$data ) ) {
            if( 
=== count$data ) ) {
               return 
$data;
            }
            foreach ( 
$data AS $key => $name ) {
                
$data$key ] = $this->doDecode$name$settings );
            }
            return 
$data;
        }
        return 
utf8_decode($data);
    }

    
/**
     * Do the converting of data to UTF-8, recursive
     *
     * @param array $data
     * @param array $settings
     * @return array
     */
    
function doEncode$data$settings ) {
        if( 
true === is_array$data ) ) {
            if( 
=== count$data ) ) {
               return 
$data;
            }
            foreach ( 
$data AS $key => $name ) {
                
$data$key ] = $this->doEncode$name$settings );
            }
            return 
$data;
        }
        if( 
true === $this->isUTF8$data ) ) {
            return 
$data;
        }
        return 
utf8_encode($data);
    }

    
/**
     * Method to check if a string is UTF-8
     *
     * @param string $string
     * @return boolean
     */
    
function isUTF8($string)
    {
        
// from http://w3.org/International/questions/qa-forms-utf-8.html
        
return != preg_match('%^(?:
                 [\x09\x0A\x0D\x20-\x7E]            # ASCII
               | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
               |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
               | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
               |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
               |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
               | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
               |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
           )*$%xs'
$string);
    }
}
?>

Comments 287

CakePHP team comments Author comments

Question

1 UTF8

Has anybody tried this patch for app/app_model.php?

class AppModel extends Model{
static $utf8IsSet = false;

function __construct(){
if(!self::$utf8IsSet) {
$this->execute("SET NAMES 'utf8'");
self::$utf8IsSet = true;
}
parent::__construct();
}

}
code by Andreas Waidelich.
http://groups.google.com/group/cake-php/browse_thread/thread/902d931ff87eb8ac/c4ca2c14891df179

You have to upgrade db and tables to UTF-8 charset
see http://drupal.org/node/105151 for example.

It is possible that some modification of
/app/views/layouts/default.thtml
is needed:
echo $html->charset('UTF-8');
I'm not sure are there any other methods to change html page encoding for the application.

So far everything looks fine, my first tiny app is capable to read unicode text input from a browser and render it back to the user.
posted Mon, Apr 2nd 2007, 17:31 by frps

Comment

2 utf8 all over

I've been using this in my app_model:

function __construct($id = null, $table = null, $ds = null)
{
parent::__construct($id, $table, $ds);

if (!defined('MYSQL_SET_NAMES_UTF8') && $this->useTable!==false)
{
$this->execute("SET NAMES 'UTF8'");
define('MYSQL_SET_NAMES_UTF8', true);
}
}
posted Sat, May 5th 2007, 09:36 by ryan morris

Comment

3 Take a look at DBOs

I've been using this in my app_model:

function __construct($id = null, $table = null, $ds = null)
{
parent::__construct($id, $table, $ds);

if (!defined('MYSQL_SET_NAMES_UTF8') && $this->useTable!==false)
{
$this->execute("SET NAMES 'UTF8'");
define('MYSQL_SET_NAMES_UTF8', true);
}
}


As I keep all my data in UTF-8 (MySQL 4.1.xxx) I used to do almost the same before I took a dive deeper in the code.

Now in my database.php I add a key 'encoding' like this:


    var $default = array(
        'driver' => 'mysqli',
        'persistent' => false,
        'host' => 'localhost',
        'login' => 'root',
        'password' => 'pass',
        'database' => 'baza',
        'encoding' => 'UTF8',
        'prefix' => ''
    );


See setEncoding method for mysql(i) and postrges DBOs.
posted Thu, May 10th 2007, 09:29 by zeRUS

Login to Submit a Comment