Hi, Please see more detailed analysis in bug #555922 that I filed against libc (because in fact sed's regex implementation is based on, or is a copy of libc's one and this bug affects many more packages).
In my opinion, the proper solution for sed would be: 1. --binary option should throw sed in a true binary mode without any knowledge of UTF-8 or any other multibyte encodings. This would allow to process binary files without any UTF-8 logic. And this would allow direct manipulation of ill-formed UTF-8 sequences in cases when it is required. The Unicode standard notes one such situation ([1], page 62): For example, a UTF-8 file could have had CRLF sequences introduced at every 80 bytes by a bad mailer program. This could result in some UTF-8 byte sequences being interrupted by CRLFs, producing illegal byte sequences. This mangled text is no longer UTF-8. It is permissible for a conformant program to repair such text, recognizing that the mangled text was originally well-formed UTF-8 byte sequences. By introducing a "true" binary mode, one could write a sed script for such purpose. 2. Otherwise, if input encoding is UTF-8, all input text should be processed with a conformat UTF-8 decoder and all ill-formed sequences should be replaced with replacement character. The result is passed to the current implementation of regexes. But now we would have a guarantee that a sed script can process any input in an expected way. It can even match ill-formed sequences by matching the replacement character. [1] http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf Best regards, Dmitri Gribenko -- main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if (j){printf("%d\n",i);}}} /*Dmitri Gribenko <griboz...@gmail.com>*/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org