John Sampson wrote: > I notice that the string method 'lower' seems to convert some strings > (input from a text file) to Unicode but not others. > This messes up sorting if it is used on arguments of 'sorted' since > Unicode strings come before ordinary ones. > > Is there a better way of case-insensitive sorting of strings in a list? > Is it necessary to convert strings read from a plaintext file > to Unicode? If so, how? This is Python 2.7.8.
The standard recommendation is to convert bytes to unicode as early as possible and only manipulate unicode. This is more likely to give correct results when slicing or converting a string. $ cat tmp.txt ähnlich üblich nötig möglich Maß Maße Masse ÄHNLICH $ python Python 2.7.6 (default, Mar 22 2014, 22:59:56) [GCC 4.8.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> for line in open("tmp.txt"): ... line = line.strip() ... print line, line.lower() ... ähnlich ähnlich üblich üblich nötig nötig möglich möglich Maß maß Maße maße Masse masse ÄHNLICH Ähnlich Now the same with unicode. To read text with a specific encoding use either codecs.open() or io.open() instead of the built-in (replace utf-8 with your actual encoding): >>> import io >>> for line in io.open("tmp.txt", encoding="utf-8"): ... line = line.strip() ... print line, line.lower() ... ähnlich ähnlich üblich üblich nötig nötig möglich möglich Maß maß Maße maße Masse masse ÄHNLICH ähnlich Unfortunately this will not give the order that you (or a german speaker in the example below) will probably expect: >>> print "".join(sorted(io.open("tmp.txt"), key=unicode.lower)) Masse Maß Maße möglich nötig ähnlich ÄHNLICH üblich For case-insensitive sorting you get better results with locale.strxfrm() -- but this doesn't accept unicode: >>> import locale >>> locale.setlocale(locale.LC_ALL, "") 'de_DE.UTF-8' >>> print "".join(sorted(io.open("tmp.txt"), key=locale.strxfrm)) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128) As a workaround you can sort first: >>> print "".join(sorted(open("tmp.txt"), key=locale.strxfrm)) ähnlich ÄHNLICH Maß Masse Maße möglich nötig üblich You should still convert the result to unicode if you want to do further processing in Python. -- https://mail.python.org/mailman/listinfo/python-list