Chris Angelico wrote: > On Mon, Aug 15, 2011 at 4:20 PM, Artie Ziff <artie.z...@gmail.com> wrote: >> if I am using the standard csv library to read contents of a csv file >> which contains Unicode strings (short example: >> '\xe8\x9f\x92\xe8\x9b\x87'), how do I use a python Unicode method such as >> decode or encode to transform this string type into a python unicode >> type? Must I know the encoding (byte groupings) of the Unicode? Can I get >> this from the file? Perhaps I need to open the file with particular >> attributes? > > Start here: > > http://www.joelonsoftware.com/articles/Unicode.html > > The CSV file, being stored on disk, cannot contain Unicode strings; it > can only contain bytes. If you know the encoding (eg UTF-8, UCS-2, > etc), then you can decode it using that. If you don't, your best bet > is to ask the origin of the file; failing that, check the first few > bytes - if it's "\xFF\xFE" or "\xFE\xFF" or "\xEF\xBB\xBF", then it's > probably UTF-16LE, UTF-16BE, or UTF-8, respectively (those being the > encodings of the BOM). There may be other clues, too, but normally > it's best to get the encoding separately from the data rather than try > to decode it from the data itself.
As this problem really is not a new one, there are several more – if I may say so – pythonic approaches: <http://stackoverflow.com/questions/436220/python-is-there-a-way-to- determine-the-encoding-of-text-file> Improving Billy Mays' "matching brackets" checker, chardet worked for me (the test file was UTF-8-encoded). Watch for word-wrap: ----------------------------------------------------------------------- # encoding: utf-8 ''' Created on 2011-07-18 @author: Thomas 'PointedEars' Lahn <pointede...@web.de>, based on an idea of Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9...@myhashismyemail.com> in <news:j01ph6$knt$1...@speranza.aioe.org> ''' import sys, os, chardet pairs = {u'}': u'{', u')': u'(', u']': u'[', u'”': u'“', u'›': u'‹', u'»': u'«', u'】': u'【', u'〉': u'〈', u'》': u'《', u'」': u'「', u'』': u'『'} valid = set(v for pair in pairs.items() for v in pair) if __name__ == '__main__': for dirpath, dirnames, filenames in os.walk(sys.argv[1]): for name in filenames: stack = [' '] file_path = os.path.join(dirpath, name) with open(file_path, 'rb') as f: reported = False lines = enumerate(f, 1) encoding = chardet.detect(''.join(map(lambda x: x[1], lines)))['encoding'] chars = ((c, line_no, col) for line_no, line in lines for col, c in enumerate(line.decode(encoding), 1) if c in valid) for c, line_no, col in chars: if c in pairs: if stack[-1] == pairs[c]: stack.pop() else: if not reported: first_bad = (c, line_no, col) reported = True else: stack.append(c) print '%s: %s' % (name, ("good" if len(stack) == 1 else "bad '%s' at %s:%s" % first_bad)) ----------------------------------------------------------------------- HTH -- PointedEars Bitte keine Kopien per E-Mail. / Please do not Cc: me. -- http://mail.python.org/mailman/listinfo/python-list