Re: Detect character encoding

Martin v. Löwis Sun, 04 Dec 2005 23:05:42 -0800

Diez B. Roggisch wrote:
> So cp1250 doesn't have all codepoints defined - but the others have. 
> Sure, this helps you to eliminate 1 of the three choices the OP wanted 
> to choose between - but how many texts you have that have a 129 in them?


For the iso8859 ones, you should assume that the characters in
range(128, 160) really aren't used. If you get one of these, and it is
not utf-8, it is a Windows code page.

UTF-8 can be recognized pretty reliable: even though it allows all bytes
to appear, it is very constraint in what sequences of bytes it allows.
E.g. you can't have a single byte >127 in UTF-8; you need atleast two
of them subsequent, and they need to meet more constraints.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Detect character encoding

Reply via email to