Pierre Joye wrote: > On Tue, Jun 21, 2011 at 1:33 PM, Lester Caine<les...@lsces.co.uk> wrote: >> Pierre Joye wrote: >>>> >>>> It depended on ICU there, and I would be against making a core thing in >>>>> PHP 5.x depend on ICU. >>> >>> It can and should be done as part of intl, actually. >>> >>> But that's somehow unrelated to the proposal here, as it is about >>> byte, not characters :) >> >> I believe this may be where some of the new niggles may be coming from? With >> browsers returning unicode, it may be that some of the 'extra' characters >> are being returned as multibyte rather than as single bytes? Such as the >> problem reported on the general list currently. How do we ensure that we are >> dealing with single byte character strings nowadays? > > As it has been stated numerous times in this thread and other, we do > not do anything with multi bytes systems, unicode, etc. mbstring and > intl do, but php's string as of now is all about bytes, array of bytes > if I may describe them this way. > > And we can't change this behavior.
This mindset is fundamentally broken. You can call it a byte array all you want, but the truth is that 99.999% of the time, when a developer is using a string they need it for characters, not for bytes, and characters are not single byte. Even English users tend to submit Unicode range characters at an alarming rate. If you're using a WYSIWYG editor, Chrome will submit non-breaking-spaces as the actual UTF8 encoded character, not as an HTML encoded entity. Whether developers like it, or even know it, supporting an extended universal character set is not really optional. PHP makes this bad enough with the whole collection of bytewise string functions, including many with no appropriate multibyte aware replacement, but at least this can be avoided, quickly audited, and in the future can even be fixed in any number of ways with only a nominal BC impact. Hard coding this single byte idiocy into a language construct (foreach) though would be an incredibly awful idea. This would create a trap for new naive PHP developers, and create a character set problem that the language could NEVER recover from without a massive BC break. This proposal is really about adding a feature which whenever it used is almost guaranteed to be an error. It probably won't look to the developer like an error during simple testing, but will almost certainly show up as an error in production. Is it really worth all that for a bit of syntax sugar that the developer will have to strip out anyway to fix their bug? If string iteration needs to be addressed in the core (and IMO it doesn't because it can be handled at the script level, but if it does) why not use iterator classes? This gives the same functionality and prevents the language from encouraging hidden bugs. John Crenshaw Priacta, Inc. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php