Re: [apache/trafficserver] HTTP2 drain (#1710)

Masakazu Kitajo Tue, 13 Jun 2017 08:17:01 -0700

On Tue, Jun 13, 2017 at 12:56 PM, James Peach <jpe...@apache.org> wrote:


>
>
> On 11 Jun 2017, at 4:45, Masakazu Kitajo wrote:
>
> On Sun, Jun 11, 2017 at 1:27 PM, James Peach <jpe...@apache.org> wrote:
>>
>>
>>>
>>> On 9 May 2017, at 0:02, Masakazu Kitajo wrote:
>>>
>>> Hi,
>>>
>>>>
>>>> Don't worry guys. I'm working on it.
>>>>
>>>> I merged the PR to have a draining (shedding) logic for HTTP/2 (Double
>>>> GOAWAY frames), and now I'm generalizing the flag so that we can invoke
>>>> draining on both HTTP/1 and HTTP/2. I'm going to add "Connection: close"
>>>> header automatically for HTTP/1. So, "proxy.config.stop.shutdown_ti
>>>> meout"
>>>> will be available for both versions. The PR was step 1, and it will be
>>>> step
>>>> 2.
>>>>
>>>>
>>> “proxy.config.stop.shutdown_timeout” is still a mis-feature. What really
>>> matters for shutting down is how much traffic you are still serving,
>>> which
>>> is expressed by “proxy.config.restart.active_client_threshold”. I think
>>> you might also want to express a maximum time you are willing to wait,
>>> but
>>> just having a fixed time is not what we should be doing.
>>>
>>
>>
>> I agree the current implementation is not perfect, but that part is what
>> we
>> haven't completed yet. Probably we need to use the both configurations
>> because we can’t wait forever until traffic goes down, and we can use
>> "proxy.config.stop.shutdown_timeout" as
>> "proxy.config.restart.active_client_timeout”.
>>
>
> What are you planning for proxy.config.restart.active_client_timeout to
> do?


It's what you suggested before, and I think it is the same as the one you
(re-)explained below (restart.max_shutdown_wait).


>
>
> I think their intention is the same and the difference is which process use
>> it.
>>
>
> I’d like for “proxy.config.stop.shutdown_timeout” to go away and be
> replaced by something like “proxy.config.restart.max_shutdown_wait” which
> is the maximum time you are willing to wait for the clients to drain. If
> the traffic drains to the acceptable threshold faster, then you get to
> restart faster.
>

The behavior should be like that. I assume you are thinking of doing that
on traffic_ctl side, right? I'm fine with that.

Currently "proxy.config.stop.shutdown_timeout" is read by traffic_server,
and traffic_server just waits for the timeout, and shutsdown. However, my
understanding is that its intension (what we really wanted to do) is not
delaying shutdown for specified time amount. It's buying time so that
active clients complete their transactions.

Although the name may sound different, but the purpose of
"stop.shutdown_timeout" has been the same as the one you suggested, at
least in my mind. We don't have to introduce new one, and I'm fine with
renaming it.

One thing you may disagree is that traffic_server should also be able to
shutdown gracefully without other processes. Since we support running
traffic_server standalone for some purposes, I think it should be possible.


> Also, while you are draining, you should also turn keep alive off and
> automatically turn down the idle session timeouts.


Ok, the keep-alive one will be done by the PR#2106. Turning down the idle
session timeouts sounds good idea. I'll send another PR later.


>
>
> "traffic_ctl server stop --drain" is also in my mind. Once we have one
>>>
>>>> generalized way for draining in traffic_server, we can invoke it from
>>>> anywhere we want (e.g. from traffic_manager, TS API, metric). Currently
>>>> it
>>>> is invoked using the flag, but eventually, we should be able to decouple
>>>> draining and shutdown, and should also be able to remove the flag and
>>>> sleep().
>>>>
>>>> Doing these all at once is heavy and the PR would be big, and it slows
>>>> review and progress, IMO. You might want full featured graceful shutdown
>>>> but this is a reason why I didn't ask to do it.
>>>>
>>>>
>>> I’m fine with (and even prefer) doing this in pieces, but we need to
>>> agree
>>> on the direction and that the end result will be what we want.
>>>
>>
>> Ok, probably the step after PR#2106 will be how we invoke graceful
>> shutdown
>> and which process should wait for / determine the completion of draining.
>> These
>> should be the points we still need to discuss and agree on.
>>
>> Actually, I basically agree on your last suggestion. I'm not keeping the
>> configuration and the flag, but just left them on the PR.
>>
>
> OK, that sounds good :)
>
>
>
>> Thanks,
>>>
>>>> Masakazu
>>>>
>>>>
>>>>
>>>> On Tue, May 9, 2017 at 2:40 PM, CrazyCow <zhangzizhong0...@gmail.com>
>>>> wrote:
>>>>
>>>> I don't disagree. The reasons I chose this way are:
>>>>
>>>>> 1. We are using other stuff in my team instead of traffic_ctl to manage
>>>>> the
>>>>> process and do the upgrade.
>>>>> 2. traffic_ctl --drain can only support HTTP and it can only be used
>>>>> when
>>>>> restarting ATS. That makes it hardly useful in our use case.
>>>>>
>>>>> 2017-05-08 22:16 GMT-07:00 James Peach <jpe...@apache.org>:
>>>>>
>>>>>
>>>>>
>>>>>> On 8 May 2017, at 21:48, Miles Libbey wrote:
>>>>>>
>>>>>> We'd also like a bit more fine grained control in the process -- we
>>>>>>
>>>>>> frequently want to perform maintenance on a server (upgrading ATS;
>>>>>>> upgrading the OS; performing hardware changes, etc) after draining
>>>>>>> but
>>>>>>> before restarting ATS. I suppose this would mean allowing the --drain
>>>>>>> option to apply to traffic_Ctl server stop.
>>>>>>>
>>>>>>>
>>>>>>> Yes I agree that draining make sense for stop as well as restart.
>>>>>>
>>>>>>
>>>>>>
>>>>>> miles
>>>>>>
>>>>>>>
>>>>>>> On Mon, May 8, 2017 at 8:19 PM, James Peach <jpe...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>> This patch adds another separate shutdown mechanism that only works
>>>>>>> for
>>>>>>>
>>>>>>>> HTTP/2. I think that we really ought to have a single, well-defined
>>>>>>>> graceful
>>>>>>>> shutdown that works for all protocols.
>>>>>>>>
>>>>>>>> In HTTP/1.1 it works like this:
>>>>>>>>         - You (possibly dynamically) set
>>>>>>>> proxy.config.restart.active_client_threshold
>>>>>>>>         - You run “traffic_ctl server restart —drain”
>>>>>>>>
>>>>>>>> Then, once client connections have been drained to the threshold,
>>>>>>>> traffic_server restarts. Note that this assumes that you have some
>>>>>>>> additional orchestration that triggers header_rewrite to inject a
>>>>>>>> “Connection: close” header, and tell the GSLB to stop sending new
>>>>>>>> connections.
>>>>>>>>
>>>>>>>> After this HTTP/2 change, there is a new graceful shutdown path
>>>>>>>>         - You (at startup only) set proxy.config.stop.shutdown_tim
>>>>>>>> eout
>>>>>>>>         - You send a signal to traffic_server
>>>>>>>>
>>>>>>>> This flips a global variable to the “drain” state, then sleeps in
>>>>>>>> the
>>>>>>>> signal
>>>>>>>> handler until the timeout is reached. In HTTP/2 only, any new
>>>>>>>>
>>>>>>>> connections
>>>>>>>
>>>>>>
>>>>> will be accepted and then immediately closed.
>>>>>>
>>>>>>>
>>>>>>>> To rationalize these disparate approaches, I suggest that we go back
>>>>>>>> to
>>>>>>>> the
>>>>>>>> traffic_ctl methodology, and enhance it so that it sends a message
>>>>>>>> to
>>>>>>>> traffic_server that puts it into a known draining state. This should
>>>>>>>> be
>>>>>>>> published in a metric and should be reversible so you can abort the
>>>>>>>> drain.
>>>>>>>> The metric can be observed by HTTP/1.1 and HTTP/2 to take
>>>>>>>> appropriate
>>>>>>>> action. We should also add a new setting
>>>>>>>> “proxy.config.restart.active_client_timeout” (or something like
>>>>>>>> that)
>>>>>>>>
>>>>>>>> to
>>>>>>>
>>>>>>
>>>>> handle the maximum time to wait for traffic to drain.
>>>>>>
>>>>>>>
>>>>>>>> I’m not sure whether it is a good idea to close connections in
>>>>>>>> HTTP/2
>>>>>>>> while
>>>>>>>> we are in draining state. If there is a desire for this, I would
>>>>>>>> like
>>>>>>>>
>>>>>>>> it
>>>>>>>
>>>>>>
>>>>> to
>>>>>>
>>>>>>> be configurable (defaulting to off).
>>>>>>>>
>>>>>>>> On 8 May 2017, at 18:17, Masakazu Kitajo wrote:
>>>>>>>>
>>>>>>>> Merged #1710.
>>>>>>>>
>>>>>>>>
>>>>>>>>> --
>>>>>>>>> You are receiving this because you are subscribed to this thread.
>>>>>>>>> Reply to this email directly or view it on GitHub:
>>>>>>>>> https://github.com/apache/trafficserver/pull/1710#event-1073784726
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>

Re: [apache/trafficserver] HTTP2 drain (#1710)

Reply via email to