Am 21.06.2011 22:19, schrieb Tomas Kuliavas:
> 2011.06.21 20:51 Reindl Harald rašė:
>>> utf-8 is strict format. If you expect utf-8 and someone submits
>>> something
>>> else, you can tell that without any string function. You can verify
>>> utf-8
>>> strings in pcre. You can convert nbspace to regular space, if you want.
>>> utf-8 does not have any byte sequence that can collide with nbspace byte
>>> sequence in utf-8
>>
>> show me a practicable way to detect if some input data contains UTF8
>> mb_string-functions are out of the game because there are many servers
>> even of real big companies where they are not available
> 
> :) I've said pcre and not mbstring. If you read fine utf-8 manual like I
> did about 8 years ago, you would know how to detect 8bit inputs that are
> not in utf-8. utf-8 is variable byte length character set which has very
> specific rules about the way bytes are arranged. You can tell length of
> symbol in bytes based on first byte. You can tell what kind of byte values
> should be used for second, third, fourth, fifth or sixth byte. If you
> eliminate five valid utf-8 8bit byte sequences and still have 8bit data,
> it is not utf-8

i do not understand any word and miss a simple str_is_utf8() or call it
as you like which can do this native and performant on a given variable
and would offer the possibility to stop a script with not expected input
without degrade performance


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to