Kurt Mueller wrote: > Am 29.08.2013 11:12, schrieb Peter Otten: >> kurt.alfred.muel...@gmail.com wrote: >>> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote: >>>> On 28/8/2013 04:32, Kurt Mueller wrote: >>>>> For some text manipulation tasks I need a template to split lines >>>>> from stdin into a list of strings the way shlex.split() does it. >>>>> The encoding of the input can vary. > >> You can compromise and read ahead a limited number of lines. Here's my >> demo script (The interesting part is detect_encoding(), I got a bit >> distracted by unrelated stuff...). The script does one extra >> decode/encode cycle -- it should be easy to avoid that if you run into >> performance issues. > > I took your script as a template. > But I used the libmagic library (pyhton-magic) instead of chardet. > See http://linux.die.net/man/3/libmagic > and https://github.com/ahupp/python-magic > ( I made tests with files of different size, up to 1.2 [GB] ) > > I had following issues: > > - I a real file, the encoding was detected as 'ascii' for > detect_lines=1000. > In line 1002 there was an umlaut character. So then the > line.decode(encoding) failed. I think to add the errors parameter, > line.decode(encoding, errors='replace')
Tough luck ;) You could try and tackle the problem by skipping leading ascii-only lines. Untested: def detect_encoding(instream, encoding, detect_lines, skip_ascii=True): if encoding is None: encoding = instream.encoding if encoding is None: if skip_ascii: try: for line in instream: yield line.decode("ascii") except UnicodeDecodeError: pass else: return head = [line] head.extend(islice(instream, detect_lines-1)) encoding = chardet.detect("".join(head))["encoding"] instream = chain(head, instream) for line in instream: yield line.decode(encoding) Or keep two lists, one with all, and one with only non-ascii lines, and read lines until there are enough lines in the list of non-ascii strings to make a good guess. Then take that list to determine the encoding. You can even combine both approaches... > - If the buffer was bigger than about some Megabytes, the returned > encoding > from libmagic was always None. The big files had very long lines ( more > than 4k per line ). So with detect_lines=1000 this limit was exceeded. > > - The magic.buffer() ( the equivalent of chardet.detect() ) takes about 2 > seconds > per megabyte buffer. -- https://mail.python.org/mailman/listinfo/python-list