[ https://issues.apache.org/jira/browse/HTTPCORE-757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17761576#comment-17761576 ]
Michael Osipov commented on HTTPCORE-757: ----------------------------------------- W/o looking into it, I bet you need a four-byte buffer to read bytes until you reach a valid UTF-8 sequences instead of single bytes. > AbstractCharDataConsumer jams up with incomplete UTF-8 data > ----------------------------------------------------------- > > Key: HTTPCORE-757 > URL: https://issues.apache.org/jira/browse/HTTPCORE-757 > Project: HttpComponents HttpCore > Issue Type: Bug > Affects Versions: 5.2.2 > Reporter: Simon White > Priority: Major > Fix For: 5.2.3, 5.3-alpha1 > > > While streaming UTF-8-encoded data with the async HTTP client, we observed > the following behaviour: > * After several minutes of consuming from our stream, the client jammed up > permanently and did not recover without a restart > Upon closer inspection, we realised that `AbstractCharDataConsumer` (which we > were extending to parse our data) was receiving incomplete UTF-8 characters > from the end of the stream (i.e. the last character in the stream was > multi-byte and we hadn't yet received all bytes for it), and this was causing > it to go into an infinite loop on the following code: > {code:java} > @Override > public final void consume(final ByteBuffer src) throws IOException { > final CharsetDecoder charsetDecoder = getCharsetDecoder(); > while (src.hasRemaining()) { > checkResult(charsetDecoder.decode(src, charBuffer, false)); > doDecode(false); > } > }{code} > This was fairly time-consuming to figure out and required us to go deep into > the brain of the library. > We don't know how this could be improved exactly, but a couple of thoughts: > * If this class expects a completely valid text string in the buffer with no > trailing bytes: > ** Then it should throw some exception once it detects that it's failing to > completely process the buffer > ** And the caller could deal with this somehow (either by catching this > exception and waiting for more data, or otherwise ensuring that the input is > valid before calling the consumer - though it's not clear how it could do > that without also having knowledge of the encoding) > ** Alternatively, the caller could simply bubble up the exception and let us > know that we shouldn't be using this class when there is only partial data. > That would also have helped us to diagnose the issue > * OTOH if this class is expected to be able to handle partially complete > input: > ** Then it should store the trailing unprocessable bytes into a buffer, and > prepend them to the beginning of the next input (hopefully resulting in a > valid UTF-8 string, though it would also have to handle the case where it > didn't) > ** This was roughly how we solved the issue on our side - we extended ` > AbstractBinDataConsumer` instead and handled the encoding ourselves -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@hc.apache.org For additional commands, e-mail: dev-h...@hc.apache.org