Rehceb Rotkiv wrote: > Hello, > > I have this little grep-like program: > > ++++++++++snip++++++++++ > #!/usr/bin/python > > import sys > import re > > pattern = sys.argv[1] > inputfile = file(sys.argv[2], 'r') > > for line in inputfile: > matches = re.findall(pattern, line) > if matches: > print matches > ++++++++++snip++++++++++ > > Like this, the program prints some characters as strange escape > sequences, which is due to the input file being encoded in utf-8:
So the UTF-8 data gets printed to your terminal which isn't configured for UTF-8, right? > When I convert "re.findall..." to a string and wrap an "unicode()" around it, > the matches get printed correctly. How do you meaningfully convert it to a string? The matches variable refers to a list, but you surely don't want to be dealing with the list's string representation. > Is it possible to make "matches" unicode without saving it as a single string > first? Why not convert your input into Unicode and then, for the benefit of certain kinds of character classes, use re.findall in Unicode mode (by specifying re.U as a flag)? Then, each match will be produced as a Unicode object. > The function "unicode()" seems only to work for strings. Or is there a > general way of telling > Python to abandon the ancient and evil land of iso-8859 for good and use > utf-8 only? The only refuge from ancient and evil lands is found by climbing the mountain of Unicode: convert from encoded text as soon as you can, work only with Unicode objects, produce encoded text only when necessary. Paul -- http://mail.python.org/mailman/listinfo/python-list