Hi,

I'm just starting to learn a bit about Unicode. I want to be able to read a 
utf-8 encoded file, and print out the codepoints it encodes.  After many false 
starts, here's a script that seems to work, but it strikes me as awfully 
awkward and unpythonic.  Have you a better way?

def codePoints(s):
    ''' return a list of the Unicode codepoints in the string s '''
    answer = []
    skip = False
    for k, c in enumerate(s):
        if skip:
            skip = False
            answer.append(ord(s[k-1:k+1]))
            continue
        if not 0xd800 <= ord(c) <= 0xdfff:
            answer.append(ord(c))
        else:
            skip = True
    return answer
            
if __name__ == '__main__':
    s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
    code = codePoints(s)
    for c in code:
        print('U+'+hex(c)[2:])

Thanks for any help you can give me.

Saul

        
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to