[PHP-DEV] PDO/Unicode Migration Strategies

Sara Golemon Mon, 09 Oct 2006 09:28:19 -0700

PDO Devs, et. al.;

It's that time, time to start looking at PDO's plans for the future,specifically how it'll integrate with the wild world of unicode. Afterworking with the sqlite2 native driver and reading up on some of theother RDBMs, I've come up with a few scenarios of varying merit for PDOthat I'd like to bounce against y'all and the world at large.

(A) PDO downcodes all inbound unicode data (SQL statements, boundparams, etc...) to UTF8, and upconverts return data (results) from UTF8to UTF16 (UChar type) on return (when UG(unicode) is enabled).


  Pros: No changes to the dbh/stmt handler APIs.
  Cons: Changes to assumptions made by many (most?) drivers.

Anywhere non-utf8 data (e.g. latin1) is expected, the data willhave to be re-converted.Doesn't cleanly account for binary strings passed in which arenot already utf-8 encoded which could easily lead to wtf when innon-unicode semantics mode (normal case for many/most users). Moresowhen the driver is trying to decide if it can use the data it receivedas-is, or if it has to transcode to get to the right charset.

(B) Change all string handling APIs (e.g. do/execute/fetch ) to includea type field (zend_uchar str_type, zstr str, int str_len) so thatdrivers get unicode as UChar*, and non-unicode as char*.

Pros: Leaves character set handling to the driver which is bestequiped to make decisions about its quirks.Binary (most likely localized) data is recognized as such andcan be handled appropriately.

  Cons: Puts more work on the actual driver to handle unicode conversion.

Leads to lots of #ifdef macrory since drivers live in PECL andmust still be compilable on PHP5.

(C) Add a UConverter *encoding_conv; element to pdo_dbh and pdo_stmtobjects, and an INI setting: pdo.default_encoding. When passing datato/from a stmt object, the statement objects encoder is used ifavailable (set during prepare), if not available the driver's converteris used (set by factory), otherwise pdo.default_encoding is used as afallback. Data exchanges between the dbh object are similarly handledthough (obviously) skipping the stmt step.


  Pros: Keeps character set conversion work out of the driver layer.
        Reduces the amount of #ifdef work for multiple version support.

Recognizes that some drivers (SQLITE) use a single encodinguniversally, while others allow different tables to use different encodings.Cons: Doesn't solve the "do()" problem of encoding to differentcharsets when inserting to tables of a driver which allows differentcharsets per table.Doesn't provide an indicator which says "This came from aunicode string and was converter by ICU so is reliably in the correctencoding" versus "This was handed to me by the user as a binary stringand may contain anything". Though this is also "fixable" by eitherchanging the handler proto or by burying a state flag in the dbh/stmtobjects.

Personally I like option C the best as it presents the least amount ofwork for individual drivers, costs the least in terms of version/ifdefs,and provides a reasonable degree of flexibility.

As mentioned however, only B provides information to the driver on thereliability of the encoding "Is this *really* utf8? Or am I going tofind a stray \xA0 in here somewhere?" Of course, we currently have nosuch assurance, the user is simply expected to give the driver wellformed data, if they don't they're SOL already.

I generally don't like A as it's the most wasteful and really doesn'tsolve the difficult problems.


Any rate, share your thoughts..

-Sara

P.S. - Where is primary PDO development happening? Last I heard PECLreleases were coming out of the 5.1 branch and that was the place to be.Has HEAD been kept in sync?


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] PDO/Unicode Migration Strategies

Reply via email to