On Sun, 16 Oct 2005 12:16:58 +0200, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <[EMAIL PROTECTED]> wrote:
>Bengt Richter wrote: >> Perhaps string equivalence in keys will be treated like numeric equivalence? >> I.e., a key/name representation is established by the initial key/name >> binding, but >> values can be retrieved by "equivalent" key/names with different >> representations >> like unicode vs ascii or latin-1 etc.? > >That would require that you know the encoding of a byte string; this >information is not available at run-time. > Well, what will be assumed about name after the lines #-*- coding: latin1 -*- name = 'Martin Löwis' ? I know type(name) will be <type 'str'> and in itself contain no encoding information now, but why shouldn't the default assumption for literal-generated strings be what the coding cookie specified? I know the current implementation doesn't keep track of the different encodings that could reasonably be inferred from the source of the strings, but we are talking about future stuff here ;-) >You could also try all possible encodings to see whether the strings >are equal if you chose the right encoding for each one. This would >be both expensive and unlike numeric equivalence: in numeric >equivalence, you don't give a sequence of bytes all possible >interpretations to find some interpretation in which they are >equivalent, either. > Agreed, that would be a mess. >There is one special case, though: when comparing a byte string >and a Unicode string, the system default encoding (i.e. ASCII) >is assumed. This only really works if the default encoding >really *is* ASCII. Otherwise, equal strings might not hash >equal, in which case you wouldn't find them properly in a >dictionary. > Perhaps the str (or future byte) type could have an encoding attribute defaulting to None, meaning to treat its instances as a current str instances. Then setting the attribute to some particular encoding, like 'latin-1' (probably internally normalized and optimized to be represented as a c pointer slot with a NULL or a pointer to an appropriate codec or whatever) would make the str byte string explicitly an encoded string, without changing the byte string data or converting to a unicode encoding. With encoding information explicitly present or absent, keys could have a normalized hash and comparison, maybe just normalizing to platform utf for dict encoding-tagged string keys by default. If this were done, IWT the automatic result of #-*- coding: latin1 -*- name = 'Martin Löwis' could be that name.encoding == 'latin-1' whereas without the encoding cookie, the default encoding assumption for the program source would be used, and set explicitly to 'ascii' or whatever it is. Functions that generate strings, such as chr(), could be assumed to create a string with the same encoding as the source code for the chr(...) invocation. Ditto for e.g. '%s == %c' % (65, 65) And s = u'Martin Löwis'.encode('latin-1') would get s.encoding == 'latin-1' not s.encoding == None so that the encoding information could make print s mean print s.decode(s.encoding) (which of course would re-encode to the output device encoding for output, like current print s.decode('latin-1') and not fail like the current default assumption for s encoding which is s.encoding==None, i.e., assume default, which is likely print s.decode('ascii')) Hm, probably s.encode(None) and s.decode(None) could mean retrieve the str byte data unchanged as a str string with encoding set to None in the result either way. Now when you read a file in binary without specifying any encoding assumption, you would get a str string with .encoding==None, but you could effectively reinterpret-cast it to any encoding you like by assigning the encoding attribute. The attribute could be a property that causes decode/encode automatically to create data in the new encoding. The None encoding would coming or going would not change the data bytes, but differing explicit encodings would cause decode/encode. This could also support s1+s2 to mean generate a concatenated string that has the same encoding attribute if s1.encoding==s2.encoding and otherwise promotes each to the platform standard unicode encoding and concatenates those if they are different (and records the unicode encoding chosen in the result's encoding attribute). This is not a fully developed idea, and there has been discussion on the topic before (even between us ;-) but I thought another round might bring out your current thinking on it ;-) Regards, Bengt Richter
-- http://mail.python.org/mailman/listinfo/python-list