Hi,

Please see more detailed analysis in bug #555922 that I filed against
libc (because in fact sed's regex implementation is based on, or is a
copy of libc's one and this bug affects many more packages).

In my opinion, the proper solution for sed would be:
1. --binary option should throw sed in a true binary mode without any
knowledge of UTF-8 or any other multibyte encodings.  This would allow
to process binary files without any UTF-8 logic.  And this would allow
direct manipulation of ill-formed UTF-8 sequences in cases when it is
required.  The Unicode standard notes one such situation ([1], page
62):

For example, a UTF-8 file could have had CRLF sequences introduced at
every 80 bytes by a bad mailer program. This could result in some
UTF-8 byte sequences being interrupted by CRLFs, producing illegal
byte sequences. This mangled text is no longer UTF-8. It is
permissible for a conformant program to repair such text, recognizing
that the mangled text was originally well-formed UTF-8 byte sequences.

By introducing a "true" binary mode, one could write a sed script for
such purpose.

2. Otherwise, if input encoding is UTF-8, all input text should be
processed with a conformat UTF-8 decoder and all ill-formed sequences
should be replaced with replacement character.  The result is passed
to the current implementation of regexes.  But now we would have a
guarantee that a sed script can process any input in an expected way.
It can even match ill-formed sequences by matching the replacement
character.

[1] http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf

Best regards,
Dmitri Gribenko

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <griboz...@gmail.com>*/



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to