Am 29.08.2013 11:12, schrieb Peter Otten: > kurt.alfred.muel...@gmail.com wrote: >> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote: >>> On 28/8/2013 04:32, Kurt Mueller wrote: >>>> For some text manipulation tasks I need a template to split lines >>>> from stdin into a list of strings the way shlex.split() does it. >>>> The encoding of the input can vary.
> You can compromise and read ahead a limited number of lines. Here's my demo > script (The interesting part is detect_encoding(), I got a bit distracted by > unrelated stuff...). The script does one extra decode/encode cycle -- it > should be easy to avoid that if you run into performance issues. Thanks Peter! I see the idea. It limits the buffersize/memory usage for the detection. I have to say that I am a bit disapointed by the chardet library. The encoding for the single character 'ü' is detected as {'confidence': 0.99, 'encoding': 'EUC-JP'}, whereas "file" says: $ echo "ü" | file -i - /dev/stdin: text/plain; charset=utf-8 $ "ü" is a character I use very often, as it is in my name: "Müller":-) I try to use the "python-magic" library which has a similar functionality as chardet and is used by the "file" unix-command and it is expandable with a magicfile, see "man file". My magic_test script: ------------------------------------------------------------------- #!/usr/bin/env python # vim: set fileencoding=utf-8 : from __future__ import print_function import magic strg_chck = 'ü' magc_enco = magic.open( magic.MAGIC_MIME_ENCODING ) magc_enco.load() print( strg_chck + ' encoding=' + magc_enco.buffer( strg_chck ) ) magc_enco.close() ------------------------------------------------------------------- $ magic_test ü encoding=utf-8 python-magic seems to me a bit more reliable. Cheers -- Kurt Mueller -- http://mail.python.org/mailman/listinfo/python-list