Hi, I'm just starting to learn a bit about Unicode. I want to be able to read a utf-8 encoded file, and print out the codepoints it encodes. After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic. Have you a better way?
def codePoints(s): ''' return a list of the Unicode codepoints in the string s ''' answer = [] skip = False for k, c in enumerate(s): if skip: skip = False answer.append(ord(s[k-1:k+1])) continue if not 0xd800 <= ord(c) <= 0xdfff: answer.append(ord(c)) else: skip = True return answer if __name__ == '__main__': s = open('test.txt', encoding = 'utf8', errors = 'replace').read() code = codePoints(s) for c in code: print('U+'+hex(c)[2:]) Thanks for any help you can give me. Saul -- http://mail.python.org/mailman/listinfo/python-list