On Mon, May 22, 2017 at 12:13 PM, Gabriel Sandor
<gabi.t.san...@gmail.com> wrote:
> I recently came across the Mozilla Charset Detectors tool, at
> https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working on
> a C# project where I could use a port of this library (e.g.
> https://github.com/errepi/ude) for advanced charset detection.

It's somewhat unfortunate that chardet got ported over to languages
like Python and C# with its shortcomings. The main shortcoming is that
despite the name saying "universal", the detector was rather arbitrary
in what it detected and what it didn't. Why Hebrew and Thai but not
Arabic or Vietnamese? Why have a Hungarian-specific frequency model
(that didn't actually work) but no models for e.g. Polish and Czech
from the same legacy encoding family?

The remaining detector bits in Firefox are for Japanese, Russian and
Ukrainian only, and I strongly suspect that the Russian and Ukrainian
detectors are doing more harm than good.

> I'm not sure however if this tool is deprecated or not, and still
> recommended by Mozilla for use in modern applications. The tool page is
> archived and most of the links are dead, while the code seems to be at
> least 7-8 years old. Could you please tell me what's the status of this
> tool and whether I should use it in my project or look for something else?

I recommend not using it. (I removed most of it from Firefox.)

I recommend avoiding heuristic detection unless your project
absolutely can't do without. If you *really* need a detector, ICU and
https://github.com/google/compact_enc_det/ might be worth looking at,
though this shouldn't be read as an endorsement of either.

With both ICU and https://github.com/google/compact_enc_det/ , watch
out for the detector's possible guess space containing very rarely
used encodings that you really don't want content detected as by
mistake.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to