Re: split lines from stdin into a list of unicode strings

Peter Otten Thu, 29 Aug 2013 06:17:39 -0700

Kurt Mueller wrote:

> I have to say that I am a bit disapointed by the chardet library.
> The encoding for the single character 'ü'
> is detected as {'confidence': 0.99, 'encoding': 'EUC-JP'},
> whereas "file" says:
> $ echo "ü" | file -i -
> /dev/stdin: text/plain; charset=utf-8
> $
> 
> "ü" is a character I use very often, as it is in my name: "Müller":-)


You cannot determine an encoding by a single letter. 

Why should "ü" be more likely than "端"? The only thing you can blame chardet 
for is that its confidence rating is a flat out lie...

For "Müller" on the other side you could probably come up with a (simple) 
heuristic that "ü" is more likely to be surrounded by ascii-letters than 
"端".


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: split lines from stdin into a list of unicode strings

Reply via email to