On Mon, May 22, 2017 at 12:13 PM, Gabriel Sandor <gabi.t.san...@gmail.com> wrote: > I recently came across the Mozilla Charset Detectors tool, at > https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working on > a C# project where I could use a port of this library (e.g. > https://github.com/errepi/ude) for advanced charset detection.
It's somewhat unfortunate that chardet got ported over to languages like Python and C# with its shortcomings. The main shortcoming is that despite the name saying "universal", the detector was rather arbitrary in what it detected and what it didn't. Why Hebrew and Thai but not Arabic or Vietnamese? Why have a Hungarian-specific frequency model (that didn't actually work) but no models for e.g. Polish and Czech from the same legacy encoding family? The remaining detector bits in Firefox are for Japanese, Russian and Ukrainian only, and I strongly suspect that the Russian and Ukrainian detectors are doing more harm than good. > I'm not sure however if this tool is deprecated or not, and still > recommended by Mozilla for use in modern applications. The tool page is > archived and most of the links are dead, while the code seems to be at > least 7-8 years old. Could you please tell me what's the status of this > tool and whether I should use it in my project or look for something else? I recommend not using it. (I removed most of it from Firefox.) I recommend avoiding heuristic detection unless your project absolutely can't do without. If you *really* need a detector, ICU and https://github.com/google/compact_enc_det/ might be worth looking at, though this shouldn't be read as an endorsement of either. With both ICU and https://github.com/google/compact_enc_det/ , watch out for the detector's possible guess space containing very rarely used encodings that you really don't want content detected as by mistake. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/ _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform