On Sat, Jul 21, 2018 at 10:14 AM Rasmus Lerdorf <ras...@lerdorf.com> wrote:
> Other than the autoincrement they are identical. I normally use utf8mb4, > but I figured I would play it safe and copy it over verbatim. I guess it > wasn't safe. > Right. There are risks. For example, encoding like SJIS contains \ as a part of valid char. When encoding is mixed, escape could be disabled and injections are possible. Even when UTF-8 is used, mixed invalid encoding handling can break security measures. e.g. Invalid UTF-8 encoding that is missing the last multibyte byte. When santaization is required, programmers have two choices. - Remove all bytes specified by MSB of UTF-8 first byte. i.e. Consume the last byte. - Remove only bytes that are invalid as UTF-8. i.e Leave last ASCII char, for example If these designs are mixed, encoding attack is possible also. DoS by invalid char is trivial. Current web browsers can refuse to render entire page that has badly broken encoding. The only good countermeasure against encoding attack is encoding validation with Fail Fast principle. i.e. Validate encoding at application software's outer most trust boundary. I'll do some research, but ideas welcome. > IMO, all data should be converted to valid UTF-8 encoding as we use UTF-8 as bugs.php.net encoding. Replace invalid date to "?" or something else. Some data will be lost, however valid char encoding is mandatory for correct data handling as described above. In order to replace invalid char to "?" (or something else), mb_convert_encoding() can be used. "mbstring.substitute_character" INI is for specifying replacing char. Default is none, so it removes invalid data by default. If you would like to keep original data, number of detected invalid chars are recorded and can be retrieved by mb_get_info()'s array. "illegal_chars" is "Total number of illegal chars in the script's lifetime". By checking this, invalid char existence can be checked. (Alternatively, simply comparing original and converted data works also.) You might want to count number of all illegal chars in the db before converting data. "illegal_chars" is handy for this. Old data may be added to converted data by using base64 if it's necessary. Regards, -- Yasuo Ohgaki yohg...@ohgaki.net