[jira] [Commented] (HTTPCORE-757) AbstractCharDataConsumer jams up with incomplete UTF-8 data

Michael Osipov (Jira) Sun, 03 Sep 2023 03:25:04 -0700


    [ 
https://issues.apache.org/jira/browse/HTTPCORE-757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17761576#comment-17761576
 ]


Michael Osipov commented on HTTPCORE-757:
-----------------------------------------

W/o looking into it, I bet you need a four-byte buffer to read bytes until you 
reach a valid UTF-8 sequences instead of single bytes.

> AbstractCharDataConsumer jams up with incomplete UTF-8 data
> -----------------------------------------------------------
>
>                 Key: HTTPCORE-757
>                 URL: https://issues.apache.org/jira/browse/HTTPCORE-757
>             Project: HttpComponents HttpCore
>          Issue Type: Bug
>    Affects Versions: 5.2.2
>            Reporter: Simon White
>            Priority: Major
>             Fix For: 5.2.3, 5.3-alpha1
>
>
> While streaming UTF-8-encoded data with the async HTTP client, we observed 
> the following behaviour:
>  * After several minutes of consuming from our stream, the client jammed up 
> permanently and did not recover without a restart
> Upon closer inspection, we realised that `AbstractCharDataConsumer` (which we 
> were extending to parse our data) was receiving incomplete UTF-8 characters 
> from the end of the stream (i.e. the last character in the stream was 
> multi-byte and we hadn't yet received all bytes for it), and this was causing 
> it to go into an infinite loop on the following code:
> {code:java}
> @Override
> public final void consume(final ByteBuffer src) throws IOException {
>     final CharsetDecoder charsetDecoder = getCharsetDecoder();
>     while (src.hasRemaining()) {
>         checkResult(charsetDecoder.decode(src, charBuffer, false));
>         doDecode(false);
>     }
> }{code}
> This was fairly time-consuming to figure out and required us to go deep into 
> the brain of the library.
> We don't know how this could be improved exactly, but a couple of thoughts:
>  * If this class expects a completely valid text string in the buffer with no 
> trailing bytes:
>  ** Then it should throw some exception once it detects that it's failing to 
> completely process the buffer
>  ** And the caller could deal with this somehow (either by catching this 
> exception and waiting for more data, or otherwise ensuring that the input is 
> valid before calling the consumer - though it's not clear how it could do 
> that without also having knowledge of the encoding)
>  ** Alternatively, the caller could simply bubble up the exception and let us 
> know that we shouldn't be using this class when there is only partial data. 
> That would also have helped us to diagnose the issue
>  * OTOH if this class is expected to be able to handle partially complete 
> input:
>  ** Then it should store the trailing unprocessable bytes into a buffer, and 
> prepend them to the beginning of the next input (hopefully resulting in a 
> valid UTF-8 string, though it would also have to handle the case where it 
> didn't)
>  ** This was roughly how we solved the issue on our side - we extended `
> AbstractBinDataConsumer` instead and handled the encoding ourselves



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@hc.apache.org
For additional commands, e-mail: dev-h...@hc.apache.org

[jira] [Commented] (HTTPCORE-757) AbstractCharDataConsumer jams up with incomplete UTF-8 data

Reply via email to