On Wed, Jun 22, 2011 at 1:37 PM, Saul Spatz <saul.sp...@gmail.com> wrote: > Hi, > > I'm just starting to learn a bit about Unicode. I want to be able to read a > utf-8 encoded file, and print out the codepoints it encodes. After many > false starts, here's a script that seems to work, but it strikes me as > awfully awkward and unpythonic. Have you a better way?
Once you have your data as a Unicode string (and you seem to be using Python 3, so 's' will be a Unicode string), wouldn't a list of its codepoints simply be this? for c in s: print('U+'+hex(ord(c))[2:]) But if you do need the codePoints() function, I'd do it as a generator. > def codePoints(s): > ''' return a list of the Unicode codepoints in the string s ''' > skip = False > for k, c in enumerate(s): > if skip: > skip = False > yield ord(s[k-1:k+1]) > continue > if not 0xd800 <= ord(c) <= 0xdfff: > yield ord(c) > else: > skip = True Your main function doesn't even have to change - it's iterating over the list, so it may as well iterate over the generator instead. But I don't really understand what codePoints() does. Is it expecting the parameter to be a string of bytes or of Unicode characters? Chris Angelico -- http://mail.python.org/mailman/listinfo/python-list