On Sun, Jul 22, 2012 at 11:50 PM, Ferenc Kovacs <tyr...@gmail.com> wrote:
> > On Wed, Mar 21, 2012 at 7:23 PM, Umberto Salsi <sa...@icosaedro.it> wrote: > >> Although I never contributed to the code of the PHP project, I hope the >> ideas >> that follow may provide some suggestion for the future developments of >> the PHP >> language. The basic idea is described in the abstract, and those that are >> not >> interested may stop there :-) >> >> >> Abstract >> ======== >> >> My modest proposal to provide Unicode support to PHP without the need to >> rewrite the whole engine and its libraries introducing the UString >> abstraction >> layer as a regular class, an with minimal support from the core engine. >> Basically, the UString class hides the internal implementation of the >> Unicode >> strings and allows to experiment with several solutions (UTF-8, UCS-2, >> UCS-4, >> ...). Several different implementations may be attempted and may also be >> made >> available, leaving to the user freedom to choose the best compromise >> between >> performances and memory footprint. Two useful functions, u() and uecho() >> are >> also discussed, that may help in writing Unicode interoperable source >> programs >> and libraries; these latter functions require some support from the PHP >> engine. >> >> >> The UString class >> ================= >> >> 1. The UString class holds an immutable array of Unicode characters, it is >> final and hides the internal representation of the Unicode string, that >> may be >> UTF-8 or UCS-2 or anything else: >> >> final class UString >> implements Hashable, Comparable, Sortable, UPrintable, >> Printable, Serializable >> { ... } >> >> (Besides the well known Serializable interface, the other implemented >> interfaces will be discussed later.) >> >> All the PHP programs and the external libraries based on this class are >> completely unaware of the actual internal encoding used. Several >> implementations may be provided to fit different needs. In western >> european >> countries the UTF-8 works well and some optimizations allow performances >> that >> are very close to ordinary "string" of bytes. For example, in my PHP >> implementation an internal byte index allows to scan quickly forward an >> backward the UTF-8 sequences. >> >> 2. The UString class has no constructor, but several factory methods that >> take >> arrays of bytes (aka string): >> >> static UString function fromASCII(string $s) >> static UString function fromUTF8(string $s) >> static UString function fromISO88591(string $s) >> static UString function fromUTF16LE(string $s) >> static UString function fromUTF16BE(string $s) >> ... >> >> These factory methods silently skip invalid bytes and invalid sequences, >> possibly replacing them with '?'. No warning, no exceptions. Other utility >> functions may be provided that check an array of bytes for a specific >> encoding. >> >> Several corresponding instance methods perform the revers traslation into >> an >> array of bytes: >> >> string function toASCII() >> string function toUTF8() >> string function toISO88591() >> string function toUTF16LE() >> string function toUTF16BE() >> ... >> >> So, for example, >> >> UString::fromUTF8( $u->toUTF8() )->equals($u) >> >> is always TRUE for any Unicode string $u (the equals() method is describe >> below). >> >> 3. The UString class provides the usual string manipulation routines: >> >> int function length() >> UString function substring($from, $to) >> UString function charAt($index) >> UString function append(UString $u) >> bool function startsWith(UString $u) >> bool function endsWith(UString $u) >> int function indexOf(UString $u) >> UString function trim($blacklist = u("\n\r\t")) >> bool function equalsIgnoreCase(UString $u) >> UString function toUpperCase() >> UString function toLowerCase() >> UString[] function explode($separator = u(" ")) >> UString function implode($separator = u("")) >> ... >> >> (More about the magic u() function later). >> >> 4. The UString class implements the UPrintable interface that returns >> "the best >> human-readable represesentation of the object as a UString string", that >> is the >> string itself: >> >> UString function __toUString(){ return $this; } >> >> 5. The UString class implements the Printable interface that returns "the >> best >> human-readable representation of the object as a string, possibly >> composed of >> ASCII characters only": >> >> string function __toString(){ return $this->toASCII(); } >> >> 6. The UString class implements the Hashable interface, useful to >> implement >> hashing algorithms (hasMap, HashSet, ...): >> >> int function getHash(){ ... } >> >> Since UString is immutable, this function may compute the hash once for >> all. In >> my current PHP implementation I have used crc32(), but the PHP engine >> hides a >> more efficient hashing function that might be used instead (what about >> making >> it available in userland code?). >> >> 7. The UString class implements the Comparable interface: >> >> bool function equals(object $u) >> >> that returns TRUE only if the object is UString and contains the same >> sequence >> of Unicode characters and returns FALSE in any other case. >> >> 8. The UString class implements the Sortable interface: >> >> int function compareTo(object $u) >> >> that returns -1, 0 or +1 if $u is UString, or raises E_WARNING if $u is >> not >> UString. >> >> >> >> The u() and the uecho() magic functions >> ======================================= >> >> Two "magic" functions helps in writing PHP programs. Basically, u() is a >> factory function that translates an array of bytes into a UString, for >> example: >> >> UString function u(string $s){ >> return UString::fromXxx($s); >> } >> >> $hello = u("hello"); >> >> where Xxx is the encoding of the source. But the u() function may do much >> more >> than this, and the implementation I have made provides several other >> features: >> >> - Literal strings are chached, so if the u("hello") statement is executed >> several times, only one single UString object is created once for all and >> this >> object is returned each time the u("hello") function is evaluated. Since >> in a >> source program the number of literal strings is finite, the string cache >> will >> result to be finite as well. >> >> - Automatically converts any type of data into UString, so that u(123) >> yields >> the same as UString::fromASCII( (string) 123 ). Here too, small numbers, >> the >> most common ones, can be cached. >> >> - If the argument is an object that implements the UPrintable interface, >> its >> __toUString() method is called; if it implements Printable, the >> __toString() >> method is called. Boolean values generate "FALSE" or "TRUE". NULL value >> generates "NULL". etc. >> >> - If the argument is UString, gives itself. >> >> - If several arguments are provided, each argument is converted into >> UString >> and concatenated with UString::append(). >> >> All this can be implemented in PHP source, and does not require changes >> to the >> engine. The uecho() function does just the same, but also sends the >> result to >> stdout using the chosen encoding: >> >> uecho(...) ====> echo u(...)->toXxx(); >> >> where Xxx is the encoding corresponding to the that used in the u() >> function to >> translate literal strings. >> >> Programmers may then write something like this: >> >> function TenHelloWorld() { >> $hello = u("hello"); >> $world = u("world!"); >> for($i = 0; $i < 10; $i++) >> uecho($i, " - ", $hello, ", ", $world); >> } >> >> This function generates 2 objects for the $hello and $world vars >> (cached), 2 >> objects for the " - " and the ", " literal strings (cached), 10 objects >> for the >> $i numbers (possibly cached), and other 10 objects for the resulting >> concatenation of the strings. If this function is called again, cached >> values >> are reused again. >> >> >> Support from the PHP engine >> =========================== >> >> R1. First of all, the current implementation of the UString class, being >> bare >> PHP code, isn't very efficient, and a C implementation of all or at least >> some >> of the most critical sections of code could greatly improve performances. >> Since >> the PHP code developed around the UString class does not depend on the >> internal >> representation of the Unicode characters, several implementations may be >> tested >> and the final decision about the "standard" one can be postponed. Or, the >> choice can be left to the users, that may choose the implementation that >> better >> fit their needs. >> >> R2. Second, the u() function requires some support from the PHP engine >> because >> it cannot be used in static expressions. The following code, for example, >> is >> not valid: >> >> function f( UString $s = u("xxx") ){ ... } >> >> class MyClass { >> const >> C = u("xxx"), >> A = array( u("zero"), u("one"), u("two") ); >> } >> >> If the PHP restriction imposed to the static expressions could be relaxed >> a bit >> at least for some magic function, the code above would be possible. >> >> R3. Third, only the engine may establish if a string that enter the u() >> function is really a literal string and not a dynamically generated >> string. For >> example, in my current PHP implementation I can only warn the programmer >> in the >> documentation from doing things like >> >> for($i = 0; $i < 10000; $i) >> uecho( "cycle no. $i" ); >> >> that would pollute the cache of u() with thousands of unuseful dynamically >> generates strings. Or, even better, the PHP engine itself might split the >> string and rewrite it automatically in a cache-aware way as: >> >> uecho("cycle no. ", $i); >> >> R4. Another area where some support from the PHP engine would be useful, >> is the >> detection of the encoding used in the source, so that the Xxx encoding to >> be >> used in the u() and uecho() functions can be automatically determined. In >> my >> current PHP implementation I stick with UTF-8, but a more general >> approach may >> take advantage from the new declare(encoding="Xxx") statement. For >> example, the >> engine might instantiate a "translator" object to be used for the current >> source, and this translator object might be made available to the program >> as a >> global variable that tranlates from array of bytes to UString and >> vice-versa: >> >> interface EncodingTranslator { >> # Encoder call-back: >> UString function encode(string $s); >> # Decoder call-back: >> string function decode(UString $u); >> } >> >> # UTF-8 specific encoder/decoder functions pair: >> class UTF8EncodingTranslator implements EncodingTranslator { >> EncodingTranslator function getInstance(){...} >> UString function encode(string $s){ return UString::fromUTF8($s); >> } >> string function decode(UString $u){ return $u->toUTF8(); } >> } >> >> # Other encoder/decoder functions pairs: >> class ISO88591EncodingTranslator implements EncodingTranslator { ... } >> class UTF16LEEncodingTranslator implements EncodingTranslator { ... } >> ... >> class ASCIIEncodingTranslator implements EncodingTranslator { ... } >> >> >> # Here the PHP engine creates the per-source file translator object, >> # setting the $curr_encoding_translator variable: >> if( the engine detected this src is UTF8 encoded ){ >> $curr_encoding_translator = new >> UTF8EncodingTranslator::getInstance(); >> } else if( the engine detected this src is ISO-8859-1 encoded ){ >> $curr_encoding_translator = new >> ISO88591EncodingTranslator::getInstance(); >> ... >> } else { >> $curr_encoding_translator = new >> ASCIIEncodingTranslator::getInstance(); >> } >> >> >> The u() function may then use the global variable >> $curr_encoding_translator to >> encode and decode every string and in every specific source program, that >> is >> $curr_encoding_translator change its value according to the source which >> is >> currently under execution. In this way libraries can be developed >> separately >> with different source encodings without affecting the interoperability >> with >> past and future programs. >> >> >> Further developments >> ==================== >> >> - I/O functions that support Unicode file names through modern dedicated >> classes/functions: FileOutputStream, FileInputStream, etc. This is >> particularly >> required under Windows whose file system uses the UCS-2 encoding (in >> brief: >> replace fopen() with _wfopen() etc. under Windows, or provide another >> "hook" >> that exposes these functions to PHP sources so that these classes can be >> implemented in PHP code). >> - String pattern matching, aka regex, but fully Unicode aware. >> - Data base abstraction layer encoding-independent. >> - A new generation of portable libraries. >> >> >> Proof of the concept - The actual PHP implementation >> ==================================================== >> >> The PHP implementation of all this is available either as documentation >> and as >> PHP source at this address: >> >> http://www.icosaedro.it/phplint/libraries.cgi >> >> Almost all the classes listed above are available, in particular: >> >> UString >> UPattern (regex with UString) >> utf8.php (provides u() and uecho() for UTF-8 only) >> FileName (attempt to support Unicode file names on Linux and Win) >> >> >> >> Regards, >> ___ >> /_|_\ Umberto Salsi >> \/_\/ www.icosaedro.it >> >> >> -- >> PHP Unicode & I18N Mailing List (http://www.php.net/) >> To unsubscribe, visit: http://www.php.net/unsub.php >> >> > Hi, > > I think that there aren't that many people subscribed to this list, so I'm > ccing the internals list, as your suggestion is to implement/bundle this to > the core. > > For the record there is another userland library targeting Unicode support without external dependencies. https://github.com/nicolas-grekas/Patchwork-UTF8 Currently it is considered to be included in Symfony2, so that they can leverage the php extension dependencies. See https://groups.google.com/forum/#!topic/symfony-devs/FtODyLi8OYk -- Ferenc Kovács @Tyr43l - http://tyrael.hu