Re: [PHP-DEV] [RFC] IntlCharsetDetector

Yasuo Ohgaki Tue, 26 Apr 2016 21:12:07 -0700

Hi Sara,

On Wed, Apr 27, 2016 at 1:10 AM, Sara Golemon <poll...@php.net> wrote:
> On Tue, Apr 26, 2016 at 2:06 AM, Yasuo Ohgaki <yohg...@ohgaki.net> wrote:
>> Things might have been changed, but as you've mentioned encoding
>> detection is unstable and ICU is poor compared to mbstring's detection
>> at least for Japanese encodings.
>>
> For me, the difference is that I expect further work to be done on
> improving ICU, while I lack that confidence for mbstring.  If the API
> is in place early on, the library can improve underneath it to the
> point it becomes more trustworthy later, but still be usable on older
> versions of PHP (linked against newer libicu).
>
> Maybe, I dunno.  I lack the motivation to push this feature forward
> atm, merely because it's not trust-worthy now.


I can understand this.

>
>> Developers should not rely on encoding detector, but they should validate
>> encoding.
>>
> I think everyone agrees on that. :)
>
>> Problem is there are cases that developers cannot determine used encoding...
>> If we are going to have this API, it would be better to validate string with
>> detected encoding by default and disable encoding validation optionally.
>> There are cases that developers have to deal with broken string data
>> on occasion.
>>
> What do you have in mind?  Full-on pre-request input filtering?
> 'cause that's never worked right (we tried really hard to make PHP6 do
> that and it failed badly)

I'm not.

>
> Or do you mean something like wrapping the ucsdet API in a coercer
> function that only returned the original string if it detected at high
> confidence and then validated against that detection?  'cause
> honestly, that should also be left to the application IMO.

I don't have problem with this approach. Developers must be responsible
for this.

For normal web apps, developers must validate encoding if it is expected
one or not. Developers do not have guess encoding for most cases.

Developers may need to detect encoding for uploaded text files, for example.
If they use encoding detection, then they should validate text data by
detected encoding.

Experienced developers will detect encoding then validate text data with
detected encoding before saving uploaded text files. However, many developers
will detect encoding and assume text file char encoding is valid. This is the
reason why I suggest

 - detect encoding (This is done by using only the beginning of data usually)
 - then validate the text data by detected encoding
 - if validation is OK return encoding name, otherwise return error.

This would reduce chance of storing invalid text data in system.
It's not strictly required, but I think it is more developer friendly.
It's just a suggestion.

Regards,

--
Yasuo Ohgaki
yohg...@ohgaki.net

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] [RFC] IntlCharsetDetector

Reply via email to