Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

Costello, Roger L. via Unicode Mon, 24 Jul 2017 07:42:06 -0700

Hello Unicode Experts!

Suppose an application splits a UTF-8 multi-octet sequence. The application 
then sends the split sequence to a client. The client must restore the original 
sequence.


Question: is it possible to split a UTF-8 multi-octet sequence in such a way 
that the client cannot unambiguously restore the original sequence?

Here is the source of my question:

The iCalendar specification [RFC 5545] says that long lines must be folded:

        Long content lines SHOULD be split
        into a multiple line representations
        using a line "folding" technique.
        That is, a long line can be split between
        any two characters by inserting a CRLF
        immediately followed by a single linear
        white-space character (i.e., SPACE or HTAB).

The RFC says that, when parsing a content line, folded lines must first be 
unfolded using this technique:

        Unfolding is accomplished by removing
        the CRLF and the linear white-space
        character that immediately follows.

The RFC acknowledges that simple implementations might generate improperly 
folded lines:

        Note: It is possible for very simple
        implementations to generate improperly
        folded lines in the middle of a UTF-8
        multi-octet sequence.  For this reason,
        implementations need to unfold lines
        in such a way to properly restore the
        original sequence.

Can you provide an example of folding a UTF-8 multi-octet sequence such that 
there is no unambiguous way to restore the original sequence? 

/Roger

Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

Reply via email to