On 28 mar, 21:29, Benjamin Kaplan <benjamin.kap...@case.edu> wrote: > On Thu, Mar 28, 2013 at 10:48 AM, jmfauth <wxjmfa...@gmail.com> wrote: > > On 28 mar, 17:33, Ian Kelly <ian.g.ke...@gmail.com> wrote: > >> On Thu, Mar 28, 2013 at 7:34 AM, jmfauth <wxjmfa...@gmail.com> wrote: > >> > The flexible string representation takes the problem from the > >> > other side, it attempts to work with the characters by using > >> > their representations and it (can only) fails... > > >> This is false. As I've pointed out to you before, the FSR does not > >> divide characters up by representation. It divides them up by > >> codepoint -- more specifically, by the *bit-width* of the codepoint. > >> We call the internal format of the string "ASCII" or "Latin-1" or > >> "UCS-2" for conciseness and a point of reference, but fundamentally > >> all of the FSR formats are simply byte arrays of *codepoints* -- you > >> know, those things you keep harping on. The major optimization > >> performed by the FSR is to consistently truncate the leading zero > >> bytes from each codepoint when it is possible to do so safely. But > >> regardless of to what extent this truncation is applied, the string is > >> *always* internally just an array of codepoints, and the same > >> algorithms apply for all representations. > > > ----- > > > You know, we can discuss this ad nauseam. What is important > > is Unicode. > > > You have transformed Python back in an ascii oriented product. > > > If Python had imlemented Unicode correctly, there would > > be no difference in using an "a", "é", "€" or any character, > > what the narrow builds did. > > > If I am practically the only one, who speakes /discusses about > > this, I can ensure you, this has been noticed. > > > Now, it's time to prepare the Asparagus, the "jambon cru" > > and a good bottle a dry white wine. > > > jmf > > You still have yet to explain how Python's string representation is > wrong. Just how it isn't optimal for one specific case. Here's how I > understand it: > > 1) Strings are sequences of stuff. Generally, we talk about strings as > either sequences of bytes or sequences of characters. > > 2) Unicode is a format used to represent characters. Therefore, > Unicode strings are character strings, not byte strings. > > 2) Encodings are functions that map characters to bytes. They > typically also define an inverse function that converts from bytes > back to characters. > > 3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I > mentioned in the previous point. It happens to be one of the five > standard encodings that is defined for all characters in the Unicode > standard (the others being the little and big endian variants of > UTF-16 and UTF-32). > > 4) The internal representation of a character string DOES NOT MATTER. > All that matters is that the API represents it as a string of > characters, regardless of the representation. We could implement > character strings by putting the Unicode code-points in binary-coded > decimal and it would be a Unicode character string. > > 5) The String type that .NET and Java (and unicode type in Python > narrow builds) use is not a character string. It is a string of > shorts, each of which corresponds to a UTF-16 code point. I know this > is the case because in all of these, the length of "\u1f435" is 2 even > though it only consists of one character. > > 6) The new string representation in Python 3.3 can successfully > represent all characters in the Unicode standard. The actual number of > bytes that each character consumes is invisible to the user.
---------- I shew enough examples. As soon as you are using non latin-1 chars your "optimization" just became irrelevant and not only this, you are penalized. I'm sorry, saying Python now is just covering the whole unicode range is not a valuable excuse. I prefer a "correct" version with a narrower range of chars, especially if this range represents the "daily used chars". I can go a step further, if I wish to write an application for Western European users, I'm better served if I'm using a coding scheme covering all thesee languages/scripts. What about cp1252 [*]? Does this not remind somthing? Python can do better, it only succeeds to do worth! [*] yes, I kwnow, internally .... jmf -- http://mail.python.org/mailman/listinfo/python-list