Re: Chardet, file, ... and the Flexible String Representation

Steven D'Aprano Fri, 06 Sep 2013 04:02:16 -0700

On Fri, 06 Sep 2013 02:11:56 -0700, wxjmfauth wrote:

> Short comment about the "detection" tools from a previous discussion.
> 
> The tools supposed to detect the coding scheme are all working with a
> simple logical mathematical rule:
> 
> p  ==> q    <==>   non q  ==> non p .


Incorrect.

chardet does a statistical analysis of the bytes, and tries to guess what 
language they are likely to come from. The algorithm is described here:

https://github.com/erikrose/chardet/blob/master/docs/how-it-works.html

(although that's rather inconvenient to read), and here:

http://www-archive.mozilla.org/projects/intl/
UniversalCharsetDetection.html


chardet is a Python port of the Mozilla charset guesser, so they use the 
same algorithm.


> Shortly  -- and consequence  --  they do not detect a coding scheme they
> only detect "a" possible coding schme.

That at least is correct.


> The Flexible String Representation has conceptually to face the same
> problem. 

No it doesn't.


-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Chardet, file, ... and the Flexible String Representation

Reply via email to