[jira] [Created] (HTTPCORE-757) AbstractCharDataConsumer jams up with incomplete UTF-8 data

Simon White (Jira) Sat, 02 Sep 2023 05:06:05 -0700

Simon White created HTTPCORE-757:
------------------------------------

             Summary: AbstractCharDataConsumer jams up with incomplete UTF-8 
data
                 Key: HTTPCORE-757
                 URL: https://issues.apache.org/jira/browse/HTTPCORE-757
             Project: HttpComponents HttpCore
          Issue Type: Bug
            Reporter: Simon White



While streaming UTF-8-encoded data with the async HTTP client, we observed the 
following behaviour:
 * After several minutes of consuming from our stream, the client jammed up 
permanently and did not recover without a restart

Upon closer inspection, we realised that `AbstractCharDataConsumer` (which we 
were extending to parse our data) was receiving incomplete UTF-8 characters 
from the end of the stream (i.e. the last character in the stream was 
multi-byte and we hadn't yet received all bytes for it), and this was causing 
it to go into an infinite loop on the following code:
{code:java}
@Override
public final void consume(final ByteBuffer src) throws IOException {
    final CharsetDecoder charsetDecoder = getCharsetDecoder();
    while (src.hasRemaining()) {
        checkResult(charsetDecoder.decode(src, charBuffer, false));
        doDecode(false);
    }
}{code}
This was fairly time-consuming to figure out and required us to go deep into 
the brain of the library.

We don't know how this could be improved exactly, but a couple of thoughts:
 * If this class expects a completely valid text string in the buffer with no 
trailing bytes:
 ** Then it should throw some exception once it detects that it's failing to 
completely process the buffer
 ** And the caller could deal with this somehow (either by catching this 
exception and waiting for more data, or otherwise ensuring that the input is 
valid before calling the consumer - though it's not clear how it could do that 
without also having knowledge of the encoding)
 ** Alternatively, the caller could simply bubble up the exception and let us 
know that we shouldn't be using this class when there is only partial data. 
That would also have helped us to diagnose the issue
 * OTOH if this class is expected to be able to handle partially complete input:
 ** Then it should store the trailing unprocessable bytes into a buffer, and 
prepend them to the beginning of the next input (hopefully resulting in a valid 
UTF-8 string, though it would also have to handle the case where it didn't)
 ** This was roughly how we solved the issue on our side - we extended `
AbstractBinDataConsumer` instead and handled the encoding ourselves



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HTTPCORE-757) AbstractCharDataConsumer jams up with incomplete UTF-8 data

Reply via email to