Kurt Mueller wrote: > I have to say that I am a bit disapointed by the chardet library. > The encoding for the single character 'ü' > is detected as {'confidence': 0.99, 'encoding': 'EUC-JP'}, > whereas "file" says: > $ echo "ü" | file -i - > /dev/stdin: text/plain; charset=utf-8 > $ > > "ü" is a character I use very often, as it is in my name: "Müller":-)
You cannot determine an encoding by a single letter. Why should "ü" be more likely than "端"? The only thing you can blame chardet for is that its confidence rating is a flat out lie... For "Müller" on the other side you could probably come up with a (simple) heuristic that "ü" is more likely to be surrounded by ascii-letters than "端". -- http://mail.python.org/mailman/listinfo/python-list