As a userland developer due to my geographical nature i have to work with 3 languages constantly - english, russian (cyryllic) and latvian (witch has it's own share of non latin characters). I end up using utf-8 in every project. And some give me a headake of dealing with text parsing. mb_string covers just part of the functionality and can be turned off.
I personally think something has to be done about unicode handling in php after 5.4 so that we have an official method of dealing with it in the core. Probably it can be done in a namespace of its own and be new functionality to witch people should migrate. my 2 cents. 21.06.2011 17:56 пользователь "Tomas Kuliavas" <to...@users.sourceforge.net> написал: > 2011.06.21 17:38 John Crenshaw rašė: >> Pierre Joye wrote: >>> On Tue, Jun 21, 2011 at 1:33 PM, Lester Caine<les...@lsces.co.uk> >>> wrote: >>>> Pierre Joye wrote: >>>>>> >>>>>> It depended on ICU there, and I would be against making a core thing >>>>>> in >>>>>>> PHP 5.x depend on ICU. >>>>> >>>>> It can and should be done as part of intl, actually. >>>>> >>>>> But that's somehow unrelated to the proposal here, as it is about >>>>> byte, not characters :) >>>> >>>> I believe this may be where some of the new niggles may be coming from? >>>> With >>>> browsers returning unicode, it may be that some of the 'extra' >>>> characters >>>> are being returned as multibyte rather than as single bytes? Such as >>>> the >>>> problem reported on the general list currently. How do we ensure that >>>> we are >>>> dealing with single byte character strings nowadays? >>> >>> As it has been stated numerous times in this thread and other, we do >>> not do anything with multi bytes systems, unicode, etc. mbstring and >>> intl do, but php's string as of now is all about bytes, array of bytes >>> if I may describe them this way. >>> >>> And we can't change this behavior. >> >> This mindset is fundamentally broken. You can call it a byte array all you >> want, but the truth is that 99.999% of the time, when a developer is using >> a string they need it for characters, not for bytes, and characters are >> not single byte. Even English users tend to submit Unicode range >> characters at an alarming rate. If you're using a WYSIWYG editor, Chrome >> will submit non-breaking-spaces as the actual UTF8 encoded character, not >> as an HTML encoded entity. Whether developers like it, or even know it, >> supporting an extended universal character set is not really optional. > > They submit it in utf-8 only if your html form allows them to do that or > they don't follow html specification and try to exploit your form. Set > form input charset to iso-8859-1 and your nbspace will take only one byte. > > -- > Tomas > > > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: http://www.php.net/unsub.php >