On Jan 15, 2014, at 4:50 AM, Robin Becker <ro...@reportlab.com> wrote:

> On 15/01/2014 12:13, Ned Batchelder wrote:
> ........
>>> On my utf8 based system
>>> 
>>> 
>>>> robin@everest ~:
>>>> $ cat ooo.py
>>>> if __name__=='__main__':
>>>>    import sys
>>>>    s='A̅B'
>>>>    print('version_info=%s\nlen(%s)=%d' % (sys.version_info,s,len(s)))
>>>> robin@everest ~:
>>>> $ python ooo.py
>>>> version_info=sys.version_info(major=3, minor=3, micro=3,
>>>> releaselevel='final', serial=0)
>>>> len(A̅B)=3
>>>> robin@everest ~:
>>>> $
>>> 
>>> 
> ........
>> You are right that more than one codepoint makes up a grapheme, and that 
>> you'll
>> need code to deal with the correspondence between them. But let's not muddy
>> these already confusing waters by referring to that mapping as an encoding.
>> 
>> In Unicode terms, an encoding is a mapping between codepoints and bytes.  
>> Python
>> 3's str is a sequence of codepoints.
>> 
> Semantics is everything. For me graphemes are the endpoint (or should be); to 
> get a proper rendering of a sequence of graphemes I can use either a sequence 
> of bytes or a sequence of codepoints. They are both encodings of the 
> graphemes; what unicode says is an encoding doesn't define what encodings are 
> ie mappings from some source alphabet to a target alphabet.

But you’re talking about two levels of encoding. One runs on top of the other. 
So insisting that you be able to call them all encodings, makes the term 
pointless, because now it’s ambiguous as to what you’re referring to. Are you 
referring to encoding in the sense of representing code points with bytes? Or 
are you referring to what the unicode guys call “forms”?

For example, the NFC form of ‘ñ’ is ’\u00F1’. ‘nThe NFD form represents the 
exact same grapheme, but is ‘\u006e\u0303’. You can call them encodings if you 
want, but I echo Ned’s sentiment that you keep that to yourself. 
Conventionally, they’re different forms, not different encodings. You can 
encode either form with an encoding, e.g.

'\u00F1'.encode('utf8’)
'\u00F1'.encode('utf16’)

'\u006e\u0303'.encode('utf8’)
'\u006e\u0303'.encode('utf16')

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to