On 20/12/2010 16:08, Alec Battles wrote:
Unicode interoperability is a pain, though, and I find it depressing to work with in Python2.x, because it never seems to behave predictably. I still have no idea why tokenizing Hungarian text and tokenizing German text are not fundamentally the same operation
I have no idea why they're not: <code - untested> import codecs with codecs.open ("german.txt", "rb", encoding="utf8") as f: german_text = f.read () with codecs.open ("hungarian.txt", "rb", encoding="utf8") as f: hungarian_text = f.read () # do_stuff_with (german_text) # do_stuff_with (hungarian_text) </code> Of course, I'm assuming that you know what encoding has been used to serialise the text, but if you don't then it's not Python's fault ;) TJG _______________________________________________ python-uk mailing list python-uk@python.org http://mail.python.org/mailman/listinfo/python-uk