Hi Sara, On Wed, Apr 27, 2016 at 1:10 AM, Sara Golemon <poll...@php.net> wrote: > On Tue, Apr 26, 2016 at 2:06 AM, Yasuo Ohgaki <yohg...@ohgaki.net> wrote: >> Things might have been changed, but as you've mentioned encoding >> detection is unstable and ICU is poor compared to mbstring's detection >> at least for Japanese encodings. >> > For me, the difference is that I expect further work to be done on > improving ICU, while I lack that confidence for mbstring. If the API > is in place early on, the library can improve underneath it to the > point it becomes more trustworthy later, but still be usable on older > versions of PHP (linked against newer libicu). > > Maybe, I dunno. I lack the motivation to push this feature forward > atm, merely because it's not trust-worthy now.
I can understand this. > >> Developers should not rely on encoding detector, but they should validate >> encoding. >> > I think everyone agrees on that. :) > >> Problem is there are cases that developers cannot determine used encoding... >> If we are going to have this API, it would be better to validate string with >> detected encoding by default and disable encoding validation optionally. >> There are cases that developers have to deal with broken string data >> on occasion. >> > What do you have in mind? Full-on pre-request input filtering? > 'cause that's never worked right (we tried really hard to make PHP6 do > that and it failed badly) I'm not. > > Or do you mean something like wrapping the ucsdet API in a coercer > function that only returned the original string if it detected at high > confidence and then validated against that detection? 'cause > honestly, that should also be left to the application IMO. I don't have problem with this approach. Developers must be responsible for this. For normal web apps, developers must validate encoding if it is expected one or not. Developers do not have guess encoding for most cases. Developers may need to detect encoding for uploaded text files, for example. If they use encoding detection, then they should validate text data by detected encoding. Experienced developers will detect encoding then validate text data with detected encoding before saving uploaded text files. However, many developers will detect encoding and assume text file char encoding is valid. This is the reason why I suggest - detect encoding (This is done by using only the beginning of data usually) - then validate the text data by detected encoding - if validation is OK return encoding name, otherwise return error. This would reduce chance of storing invalid text data in system. It's not strictly required, but I think it is more developer friendly. It's just a suggestion. Regards, -- Yasuo Ohgaki yohg...@ohgaki.net -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php