FSR: === The 'a' in 'a€' and 'a\U0001d11e:
>>> ['{:#010b}'.format(c) for c in 'a€'.encode('utf-16-be')] ['0b00000000', '0b01100001', '0b00100000', '0b10101100'] >>> ['{:#010b}'.format(c) for c in 'a\U0001d11e'.encode('utf-32-be')] ['0b00000000', '0b00000000', '0b00000000', '0b01100001', '0b00000000', '0b00000001', '0b11010001', '0b00011110'] Has to be done. sys.getsizeof('a€') 42 sys.getsizeof('a\U0001d11e') 48 sys.getsizeof('aa') 27 Unicode/utf* ============ i) ("primary key") Create and use a unique set of encoded code points. ii) ("secondary key") Depending of the wish, memory/performance: utf-8/16/32 Two advantages at the light of the above example: iii) The "a" has never to be reencoded. iv) An "a" size never exceeds 4 bytes. Hard job to solve/satisfy i), ii), iii) and iv) at the same time. Is is possible? ;-) The solution is in the problem. jmf -- http://mail.python.org/mailman/listinfo/python-list