Re: [PHP-DEV] [RFC] IntlCharsetDetector

Stanislav Malyshev Tue, 26 Apr 2016 15:52:35 -0700

Hi!

> For me, the difference is that I expect further work to be done on
> improving ICU, while I lack that confidence for mbstring.  If the API


My experience over the years has been that established supported
libraries like ICU usually have better track record in improving and
maintenance than more niche libraries, but it differs a lot from case to
case. I have no idea though how good/bad is ICU in detecting Asian
languages and encodings.

>> Developers should not rely on encoding detector, but they should validate
>> encoding.
>>
> I think everyone agrees on that. :)

True, but also incomplete. There's ideal case, and there's real world.
In ideal case, you know encodings of everything and everything is nicely
specified and shiny and rainbows and unicorns abound. Real data, though,
is messy and unpredictable and comes from places and practices that
makes one shudder. And when it comes to that we can either give the
developers at least something - an imperfect encoding detector, with all
caveats - or just ignore it and not give them anything, because it is
not matching our theories. and leave them to implement even worse hacks.
I think the former is much better approach.

And of course, detection and validation is a different thing. A text may
look like valid string in encoding A but actually be encoding B. "Tell
me if this data looks like Russian text in KOI-8 or Japanese text in
Shift-JIS" and "tell me if this is a valid or invalid UTF-8" are two
completely different tasks.
-- 
Stas Malyshev
smalys...@gmail.com

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] [RFC] IntlCharsetDetector

Reply via email to