On 2014-11-28 23:31:49 +0900, Norihiro Tanaka wrote: > Thanks for the patch. However, I seem that valid_utf() in PCRE also > considers 5 and 6 bytes characters in PCRE.
In any case, even if PCRE considers these sequences as valid UTF-8, they shouldn't match because they are not part of Unicode (if they can match, this would be a bug in libpcre). My patch considers that these sequences do not match, which is consistent with the expected behavior. > IMHO, We assume that grep doesn't know how to check for an input text in > valid_utf(), althouth we know PCRE checks whether an input text is valid > utf8 or not, so that even when PCRE changes behaviour of valid_utf(), > grep should run. > > If we do not check invalid utf8 characters with valid_utf8() in advance, > grep may cause core dump with PCRE_NO_UTF8_CHECK. > See http://debbugs.gnu.org/cgi/bugreport.cgi?bug=16586 > > So we can not avoid for checking invalid utf8 characters with valid_utf8(). > Further more, we must perform to check as PCRE expects, but grep does > not know how to PCRE to check invalid_utf8 characters due to an above > assumption. What matters is whether a sequence corresponds to a valid UTF-8 encoded Unicode character. My patch ensures that pcre_exec is called on a string with only such characters, which implies that this is also valid UTF-8 for PCRE (whether Unicode validity is also considered in valid_utf8() or not). So, there's no valid reason why grep would crash under such a condition. -- Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)