Re: Why asci-only symbols?

Bengt Richter Sun, 16 Oct 2005 20:37:45 -0700

On Sun, 16 Oct 2005 12:16:58 +0200, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= 
<[EMAIL PROTECTED]> wrote:

>Bengt Richter wrote:
>> Perhaps string equivalence in keys will be treated like numeric equivalence?
>> I.e., a key/name representation is established by the initial key/name 
>> binding, but
>> values can be retrieved by "equivalent" key/names with different 
>> representations
>> like unicode vs ascii or latin-1 etc.?
>
>That would require that you know the encoding of a byte string; this
>information is not available at run-time.
>
Well, what will be assumed about name after the lines

#-*- coding: latin1 -*-
name = 'Martin Löwis' 

?
I know type(name) will be <type 'str'> and in itself contain no encoding 
information now,
but why shouldn't the default assumption for literal-generated strings be what 
the coding
cookie specified? I know the current implementation doesn't keep track of the 
different
encodings that could reasonably be inferred from the source of the strings, but 
we are talking
about future stuff here ;-)

>You could also try all possible encodings to see whether the strings
>are equal if you chose the right encoding for each one. This would
>be both expensive and unlike numeric equivalence: in numeric 
>equivalence, you don't give a sequence of bytes all possible
>interpretations to find some interpretation in which they are
>equivalent, either.
>
Agreed, that would be a mess.

>There is one special case, though: when comparing a byte string
>and a Unicode string, the system default encoding (i.e. ASCII)
>is assumed. This only really works if the default encoding
>really *is* ASCII. Otherwise, equal strings might not hash
>equal, in which case you wouldn't find them properly in a
>dictionary.
>
Perhaps the str (or future byte) type could have an encoding attribute
defaulting to None, meaning to treat its instances as a current str instances.
Then setting the attribute to some particular encoding, like 'latin-1' (probably
internally normalized and optimized to be represented as a c pointer slot with a
NULL or a pointer to an appropriate codec or whatever) would make the str byte
string explicitly an encoded string, without changing the byte string data or
converting to a unicode encoding. With encoding information explicitly present
or absent, keys could have a normalized hash and comparison, maybe just 
normalizing
to platform utf for dict encoding-tagged string keys by default.

If this were done, IWT the automatic result of

#-*- coding: latin1 -*-
name = 'Martin Löwis' 

could be that name.encoding == 'latin-1'

whereas without the encoding cookie, the default encoding assumption
for the program source would be used, and set explicitly to 'ascii'
or whatever it is.

Functions that generate strings, such as chr(), could be assumed to create
a string with the same encoding as the source code for the chr(...) invocation.
Ditto for e.g. '%s == %c' % (65, 65)
And
    s = u'Martin Löwis'.encode('latin-1')
would get
    s.encoding == 'latin-1'
not
    s.encoding == None
so that the encoding information could make
    print s
mean
    print s.decode(s.encoding)
(which of course would re-encode to the output device encoding for output, like 
current
print s.decode('latin-1') and not fail like the current default assumption for 
s encoding
which is s.encoding==None, i.e., assume default, which is likely print 
s.decode('ascii'))

Hm, probably
    s.encode(None)
and
    s.decode(None)
could mean retrieve the str byte data unchanged as a str string with encoding 
set to None
in the result either way.

Now when you read a file in binary without specifying any encoding assumption, 
you
would get a str string with .encoding==None, but you could effectively 
reinterpret-cast it
to any encoding you like by assigning the encoding attribute. The attribute
could be a property that causes decode/encode automatically to create data in 
the
new encoding. The None encoding would coming or going would not change the data 
bytes, but
differing explicit encodings would cause decode/encode.

This could also support s1+s2 to mean generate a concatenated string
that has the same encoding attribute if s1.encoding==s2.encoding and otherwise 
promotes
each to the platform standard unicode encoding and concatenates those if they
are different (and records the unicode encoding chosen in the result's encoding
attribute).

This is not a fully developed idea, and there has been discussion on the topic 
before
(even between us ;-) but I thought another round might bring out your current 
thinking
on it ;-)

Regards,
Bengt Richter

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Why asci-only symbols?

Reply via email to