PDO Devs, et. al.;

It's that time, time to start looking at PDO's plans for the future, specifically how it'll integrate with the wild world of unicode. After working with the sqlite2 native driver and reading up on some of the other RDBMs, I've come up with a few scenarios of varying merit for PDO that I'd like to bounce against y'all and the world at large.

(A) PDO downcodes all inbound unicode data (SQL statements, bound params, etc...) to UTF8, and upconverts return data (results) from UTF8 to UTF16 (UChar type) on return (when UG(unicode) is enabled).

  Pros: No changes to the dbh/stmt handler APIs.
  Cons: Changes to assumptions made by many (most?) drivers.
Anywhere non-utf8 data (e.g. latin1) is expected, the data will have to be re-converted. Doesn't cleanly account for binary strings passed in which are not already utf-8 encoded which could easily lead to wtf when in non-unicode semantics mode (normal case for many/most users). Moreso when the driver is trying to decide if it can use the data it received as-is, or if it has to transcode to get to the right charset.


(B) Change all string handling APIs (e.g. do/execute/fetch ) to include a type field (zend_uchar str_type, zstr str, int str_len) so that drivers get unicode as UChar*, and non-unicode as char*.

Pros: Leaves character set handling to the driver which is best equiped to make decisions about its quirks. Binary (most likely localized) data is recognized as such and can be handled appropriately.
  Cons: Puts more work on the actual driver to handle unicode conversion.
Leads to lots of #ifdef macrory since drivers live in PECL and must still be compilable on PHP5.


(C) Add a UConverter *encoding_conv; element to pdo_dbh and pdo_stmt objects, and an INI setting: pdo.default_encoding. When passing data to/from a stmt object, the statement objects encoder is used if available (set during prepare), if not available the driver's converter is used (set by factory), otherwise pdo.default_encoding is used as a fallback. Data exchanges between the dbh object are similarly handled though (obviously) skipping the stmt step.

  Pros: Keeps character set conversion work out of the driver layer.
        Reduces the amount of #ifdef work for multiple version support.
Recognizes that some drivers (SQLITE) use a single encoding universally, while others allow different tables to use different encodings. Cons: Doesn't solve the "do()" problem of encoding to different charsets when inserting to tables of a driver which allows different charsets per table. Doesn't provide an indicator which says "This came from a unicode string and was converter by ICU so is reliably in the correct encoding" versus "This was handed to me by the user as a binary string and may contain anything". Though this is also "fixable" by either changing the handler proto or by burying a state flag in the dbh/stmt objects.


Personally I like option C the best as it presents the least amount of work for individual drivers, costs the least in terms of version/ifdefs, and provides a reasonable degree of flexibility.

As mentioned however, only B provides information to the driver on the reliability of the encoding "Is this *really* utf8? Or am I going to find a stray \xA0 in here somewhere?" Of course, we currently have no such assurance, the user is simply expected to give the driver well formed data, if they don't they're SOL already.

I generally don't like A as it's the most wasteful and really doesn't solve the difficult problems.

Any rate, share your thoughts..

-Sara

P.S. - Where is primary PDO development happening? Last I heard PECL releases were coming out of the 5.1 branch and that was the place to be. Has HEAD been kept in sync?

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to