Re: [PHP-DEV] [RFC] IntlCharsetDetector

Sara Golemon Tue, 26 Apr 2016 09:10:31 -0700

On Tue, Apr 26, 2016 at 2:06 AM, Yasuo Ohgaki <yohg...@ohgaki.net> wrote:
> Things might have been changed, but as you've mentioned encoding
> detection is unstable and ICU is poor compared to mbstring's detection
> at least for Japanese encodings.
>
For me, the difference is that I expect further work to be done on
improving ICU, while I lack that confidence for mbstring.  If the API
is in place early on, the library can improve underneath it to the
point it becomes more trustworthy later, but still be usable on older
versions of PHP (linked against newer libicu).


Maybe, I dunno.  I lack the motivation to push this feature forward
atm, merely because it's not trust-worthy now.

> Developers should not rely on encoding detector, but they should validate
> encoding.
>
I think everyone agrees on that. :)

> Problem is there are cases that developers cannot determine used encoding...
> If we are going to have this API, it would be better to validate string with
> detected encoding by default and disable encoding validation optionally.
> There are cases that developers have to deal with broken string data
> on occasion.
>
What do you have in mind?  Full-on pre-request input filtering?
'cause that's never worked right (we tried really hard to make PHP6 do
that and it failed badly)

Or do you mean something like wrapping the ucsdet API in a coercer
function that only returned the original string if it detected at high
confidence and then validated against that detection?  'cause
honestly, that should also be left to the application IMO.

-Sara

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] [RFC] IntlCharsetDetector

Reply via email to