Hello Brion,

Thank you for your feedback.

First of all, README.UNICODE is a bit out of date, as you probably noticed. I need to update it once we finalize this conversion/casting discussion.

Your point about writing portable Unicode-friendly code is well taken. Rasmus and I have chatted a bit here, and we think we can propose some changes that may make it easier.

With unicode_semantics=off:
* (unicode) cast converts binary strings to Unicode strings using runtime_encoding setting * (string) converts Unicode strings to binary strings using runtime_encoding again * Binary and Unicode strings cannot be concatenated. You have to cast all operands to the same type.

With unicode_semantics=on:
* (unicode) cast converts binary strings to Unicode strings. The issue here is whether to use script_encoding (in case you do (unicode)b"blah") or runtime_encoding (in case it's a binary string that came from elsewhere) * (string) converts Unicode strings to binary strings using runtime_encoding setting * Binary and Unicode strings cannot be concatenated. You have to cast all operands to the same type.

I think this will make it easier to write code, because you can always depend on the behavior of the cast operators. The (unicode) and (string) casts are basically shortcuts for unicode_encode() and unicode_decode() used with runtime_encoding setting (excepting the issue I mentioned above).

The unicode_semantics switch will not be per-request, due to a variety of reasons we have covered before.

Your suggestion about treating all string literals as Unicode if an encoding pragma is used is an interesting one and merits more discussion I think. Do you think it should affect only literals or also identifiers?

-Andrei


Both the implicit coercions and the explicit casts seem to have vanished, and
behavior is worryingly inconsistent:

With unicode_semantics off:
* (unicode) cast fails on binary strings
* (string) converts things, including Unicode strings, to binary strings
* Binary and Unicode strings can't be concatenated.
* There's no available cast from string literals and variables to Unicode strings.

With unicode_semantics on:
* (unicode) fails on binary strings
* (string) behaves as (unicode), converting things to unicode strings
* Binary and Unicode strings can't be concatenated.
* There is no available cast from Unicode string variables to binary strings.
(For literals you can use b"blah".)


This looks like a pretty painful place to be as far as writing portable
Unicode-friendly code, because there is no way to write Unicode literals that will reliably work. Even if your in-code literals are all ASCII, you can't mix
them with runtime Unicode strings because it throws a fatal error with
unicode_semantics off.

This is particularly bad if unicode_semantics can't be changed on a per-request basis; this virtually guarantees that many hosting providers will turn it off "for compatibility" or "for speed", and individual users won't be able to do a
darn thing about it.


Wrapping every string literal in a conditional call to unicode_decode() sounds less than ideal; if (unicode) casts worked they would still be pretty ugly too.

I would *love* a pragma setting like the declare(encoding="UTF-8") to say "I'm going to use Unicode string literals in this file, whatever unicode_semantics
may be." Would there be any interest in supporting a mode like this?

A Python-style modifier like u"blah" could go along with the b"blah" binary string literal as well, though I'd rather not have to put a sigil on every string...

- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFD8cU+wRnhpk1wk44RAnwKAJ99lNB5C44jvKhqbPzlBnLiUwKLBwCfYYQh
7VGvgqkgRrL+Le6bPxbsD54=
=JRAP
-----END PGP SIGNATURE-----

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to