Simon White created HTTPCORE-757:
------------------------------------
Summary: AbstractCharDataConsumer jams up with incomplete UTF-8
data
Key: HTTPCORE-757
URL: https://issues.apache.org/jira/browse/HTTPCORE-757
Project: HttpComponents HttpCore
Issue Type: Bug
Reporter: Simon White
While streaming UTF-8-encoded data with the async HTTP client, we observed the
following behaviour:
* After several minutes of consuming from our stream, the client jammed up
permanently and did not recover without a restart
Upon closer inspection, we realised that `AbstractCharDataConsumer` (which we
were extending to parse our data) was receiving incomplete UTF-8 characters
from the end of the stream (i.e. the last character in the stream was
multi-byte and we hadn't yet received all bytes for it), and this was causing
it to go into an infinite loop on the following code:
{code:java}
@Override
public final void consume(final ByteBuffer src) throws IOException {
final CharsetDecoder charsetDecoder = getCharsetDecoder();
while (src.hasRemaining()) {
checkResult(charsetDecoder.decode(src, charBuffer, false));
doDecode(false);
}
}{code}
This was fairly time-consuming to figure out and required us to go deep into
the brain of the library.
We don't know how this could be improved exactly, but a couple of thoughts:
* If this class expects a completely valid text string in the buffer with no
trailing bytes:
** Then it should throw some exception once it detects that it's failing to
completely process the buffer
** And the caller could deal with this somehow (either by catching this
exception and waiting for more data, or otherwise ensuring that the input is
valid before calling the consumer - though it's not clear how it could do that
without also having knowledge of the encoding)
** Alternatively, the caller could simply bubble up the exception and let us
know that we shouldn't be using this class when there is only partial data.
That would also have helped us to diagnose the issue
* OTOH if this class is expected to be able to handle partially complete input:
** Then it should store the trailing unprocessable bytes into a buffer, and
prepend them to the beginning of the next input (hopefully resulting in a valid
UTF-8 string, though it would also have to handle the case where it didn't)
** This was roughly how we solved the issue on our side - we extended `
AbstractBinDataConsumer` instead and handled the encoding ourselves
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]