Re: Flexible string representation, unicode, typography, ...

Terry Reedy Thu, 30 Aug 2012 13:48:39 -0700

On 8/30/2012 12:00 PM, Steven D'Aprano wrote:

On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote:

In article <503f0e45$0$9416$c3e8da3$76491...@news.astraweb.com>,
  Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote:

The only thing which is innovative here is that instead of the Python
compiler declaring that "all strings will be stored in UCS-2", the
compiler chooses an implementation for each string as needed. So some
strings will be stored internally as UCS-4, some as UCS-2, and some as
ASCII (which is a standard, but not the Unicode consortium's standard).


Is the implementation smart enough to know that x == y is always False
if x and y are using different internal representations?


Yes, after checking lengths, and in same circumstances, x != y is True. From
http://hg.python.org/cpython/file/ab6ab44921b2/Objects/unicodeobject.c

PyObject *
PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
{
    int result;

    if (PyUnicode_Check(left) && PyUnicode_Check(right)) {
        PyObject *v;
        if (PyUnicode_READY(left) == -1 ||
            PyUnicode_READY(right) == -1)
            return NULL;
        if (PyUnicode_GET_LENGTH(left) != PyUnicode_GET_LENGTH(right) ||
            PyUnicode_KIND(left) != PyUnicode_KIND(right)) {
            if (op == Py_EQ) {
                Py_INCREF(Py_False);
                return Py_False;
            }
            if (op == Py_NE) {
                Py_INCREF(Py_True);
                return Py_True;
            }
        }
...
KIND is 1,2,4 bytes/char

'a in s' is also False if a chars are wider than s chars.

If s is all ascii, s.encode('ascii') or s.encode('utf-8') is a fast,constant time operation, as I showed earlier in this discussion. This isone thing that is much faster in 3.3.

Such things can be tested by timing with different lengths of strings,where the initial string creation is done in setup code rather than inthe repeated operation code.

But x and y are not necessarily always False just because they have
different representations. There may be circumstances where two strings
have different internal representations even though their content is the
same, so it's an unsafe optimization to automatically treat them as
unequal.

I am sure that str objects are always in canonical form once visible toPython code. Note that unready (non-canonical) objects are rejected bythe rich comparison function.

My expectation is that the initial implementation of PEP 393 will be
relatively unoptimized,

The initial implementation was a year ago. At least three people haveexpended considerable effort improving it since, so that the slowdownmentioned in the PEP has mostly disappeared. The things that are stillslower are somewhat balanced by things that are faster.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Flexible string representation, unicode, typography, ...

Reply via email to