On 21/09/2020 08:18, Martin Grigorov wrote:
> On Fri, Sep 18, 2020 at 6:16 PM Mark Thomas <ma...@apache.org> wrote:
> 
>> On 18/09/2020 14:07, Martin Grigorov wrote:
>>
>> <snip/>
>>
>>> What is the difference
>>> between org.apache.coyote.http2.StreamStateMachine.State#CLOSED_RX
>>> and org.apache.coyote.http2.StreamStateMachine.State#CLOSED_TX ?
>>
>> Compare the parameters used to construct the enums.
>>
>>> I read some parts of https://tools.ietf.org/html/rfc7540 but I don't see
>>> anything related to two types of CLOSED state.
>>
>> Section 5.1. Definition of the closed state (page 20) explains the
>> difference between the two.
>>
> 
> Still I do not understand what RX and TX stand for. But this is not
> important.

TX - transmit

RX - receive


> The following patch fixes the issue for me/Vegeta:
> 
> @@ -1570,12 +1571,15 @@ class Http2UpgradeHandler extends AbstractStream
> implements InternalHttpUpgradeH
> 
>      @Override
>      public void reset(int streamId, long errorCode) throws Http2Exception
>  {
> -        Stream stream = getStream(streamId, true);
> -        boolean active = stream.isActive();
> -        stream.checkState(FrameType.RST);
> -        stream.receiveReset(errorCode);
> -        if (active) {
> -            activeRemoteStreamCount.decrementAndGet();
> +        boolean unknownIsError = Http2Error.CANCEL.getCode() != errorCode;
> +        Stream stream = getStream(streamId, unknownIsError);
> +        if (stream != null) {
> +            boolean active = stream.isActive();
> +            stream.checkState(FrameType.RST);
> +            stream.receiveReset(errorCode);
> +            if (active) {
> +                activeRemoteStreamCount.decrementAndGet();
> +            }
>          }
>      }
> 
> I.e. do not throw an error if the remote peer is trying to cancel an
> already closed stream.

RFC 7540 allows the connection to be closed with a protocol error if the
reset is received "a significant time after sending END_STREAM".

Tomcat retains state for maxConcurrentStreams + 10% with closed streams
being pruned (oldest first) every 10 new streams.

It isn't clear at this point why the client is sending the RST_STREAM.
This normally indicates an error condition. From the description of the
load test I don't see why the client would be sending a RST_STREAM.

If the RST_STREAM is valid and it isn't received a significant time
after the END_STREAM then Tomcat needs to handle this.

If the RST_STREAM is valid but received a significant amount of time
after the END_STREAM that would be a client error would be a client error.

Of course, significant isn't defined in the specification.

Any special handling for RST_STREAM also needs to be applied to
WINDOW_UPDATE. It also needs to differentiate between a frame received
for a past closed stream and a frame received for an stream that never
existed.

I think the key question here is why is the client sending the
RST_STREAM in the first place. Is Tomcat doing something it shouldn't /
not doing something it should to cause this? If so, we need to fix the
root cause rather than tackle the symptom.

There are a couple of options for debugging this to see why the client
is sending the RST_STREAM:
- Enable debug logging for HTTP/2 in Tomcat. This is very verbose and
  if the root cause is threading related typically changes the timing
  enough to mask the issue.
- Client side logging (I don't know what the capabilities are here).
- JVM level TLS debug logging (even more verbose than Tomcat's HTTP/2
  debug and harder to work with)
- Wireshark (or similar)

I'd recommend Wireshark. I don't think JSSE supports SSLKEYLOGFILE so
I'd use the APR/Native connector with SSLKEYLOGFILE to get the cleartext
in Wireshark.

> With this change and Vegeta's -max-workers=100 I can get 12 K reqs/sec.

That is odd. If the client is sending RST_STREAM messages then I'd
expect to see requests failing as RST_STREAM indicates something has
gone wrong or the client has aborted the request.

> With more workers it starts failing with:
> 
> "status_codes": {
>     "0": 1000,
>     "200": 1
>   },
>   "errors": [
>     "Get \"https://localhost:18080/testbed/plaintext\": net/http: request
> canceled while waiting for connection (Client.Timeout exceeded while
> awaiting headers)",
>     "Get \"https://localhost:18080/testbed/plaintext\": context deadline
> exceeded",
>     "Get \"https://localhost:18080/testbed/plaintext\": context deadline
> exceeded (Client.Timeout exceeded while awaiting headers)"
>   ]
> i.e. out of 1001 requests only one succeeds and the others fail with
> timeout. I will try to debug this one too and run a profiler too.

I'd probably start by enabling Tomcat's HTTP/2 debug logging for this
one. I'm wondering if something is triggering a connection level error
early in the processing.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to