Reflexions around timers in the logs

Willy Tarreau Tue, 23 Aug 2016 02:35:59 -0700

Hi all,

For quite some time now I've been annoyed by the browsers' preconnect
feature causing unexploitable %Tq in the logs. But to defend the browsers,
we're having the same issue with the keep-alive feature as well.


So I planned to change %Tq to %Ti+%Tq where Ti would be the idle time
before the first byte of request and Tq would be (as it used to be a
decade ago) the time to receive the complete request.

Recently Thierry discussed with me about the same issue and wanted also
to handle %Th (the time to complete the connection handshake, eg: SSL).

He initially proposed to have a new field in addition to the existing
ones to report the request time only, and not to change Tq. I asked him
to modify this to have Tq replaced with what it should really be but now
after having seen some such logs I'm having a doubt about the relevance
of this choice.

While it's appealing to know that every user of httplog will find again
usable logs (regarding the Tq field), I realized that this will offset
all dates by an invisible Ti value which will not be reported by default
if we don't change the default log format. Example :

  Aug 23 10:55:12 haproxy[3848]: 127.0.0.1:47034 [23/Aug/2016:10:55:09.818] 
test test/test1 246/0/0/1/2365 200 79 - - ---- 1/1/0/0/3 0/0 "GET / HTTP/1.1"

Here one wouldbe tempted to think that the complete request was received at
about 10:55:10.064 (10:55:09.818+.246) and that the server received the
connection at about the same time. But in fact this is wrong because if we
log the new %Th and %Ti, we get this :

  Aug 23 10:55:12 haproxy[3848]: 127.0.0.1:47034 [23/Aug/2016:10:55:09.818] 
test test/test1 0/2118/246/0/0/1/2365 200 79 - - ---- 1/1/0/0/3 0/0 "GET / 
HTTP/1.1"

And it clearly shows that it took me 2 seconds to type this request so all
dates are off by 2 seconds. Thus I'm tempted to roll back to Thierry's
initialy proposal but I'd like to get other opinions so that we try to do
the best thing.

Among the other possible choices, I thought that maybe we could have a new
date field which reports the date of end the start of the request and use
this one in conjunction with Tq in the HTTP logs. But then this new date
field should have the same name as the existing one in order not to confuse
people using their own modified format... which in turn causes an
inconsistency with the TCP log format, so this doesn't seem like a very
good idea.

Maybe something between all of this could consist in this :

  - we create a new date field (eg: %tr) to note the date the request
    starts to arrive (end of the idle period)
  - we use a new field for the request-only time (eg: %TR)
  - we change the default HTTP log format to use both.

That would give us %Tq = %Th + %Ti + %TR, and %tr = %t + %Th + %Ti.

This way people who use "option httplog" will suddenly have much more
exploitable logs without changing anything, as the dates will now reflect
the start of receipt of requests and not the end of last response, and the
other ones will not be harmed by any change.

Any opinion on this is welcome. Note: the main work is done, it's only a
matter of defining what we want to see in the logs, it's easy to adapt
once we know what we want. BTW, I'm also thinking about adding a field to
indicate if a request is the first one of a connection or not.

I'm attaching Thierry's revised patch for reference (the one implementing
my initial proposal that I'm not happy with anymore).

Thanks,
Willy

>From 9ff1a9f5be42f34e3e102db0581856a594d8fb60 Mon Sep 17 00:00:00 2001
From: "Thierry FOURNIER / OZON.IO" <[email protected]>
Date: Thu, 28 Jul 2016 17:19:45 +0200
Subject: MEDIUM: log: Decompose %Tq in %Th %Ti %Tq

Tq is the time between the instant the connection is accepted and a
complete valid request is received. This time includes the handshake
(SSL / Proxy-Protocol), the idle when the browser does preconnect and
the request reception.

This patch decomposes %Tq in 3 measurements names %Th, %Ti, and %Tq
which returns respectively the handshake time, the idle time and the
duration of valid request reception.

%Th is avalaible for TCP and HTTP sessions, %Ti and %Tq are only
avalaible for HTTP connexions.
---
 doc/configuration.txt  | 70 +++++++++++++++++++++++++++++++++-----------------
 include/types/log.h    |  2 ++
 include/types/stream.h |  2 ++
 src/log.c              | 26 +++++++++++++++++--
 src/proto_http.c       |  8 ++++++
 src/stream.c           |  4 +++
 6 files changed, 86 insertions(+), 26 deletions(-)

diff --git a/doc/configuration.txt b/doc/configuration.txt
index 3bb70a4..61903f0 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -14687,8 +14687,8 @@ Detailed fields description :
     a full HTTP request, not counting data. It can be "-1" if the connection
     was aborted before a complete request could be received. It should always
     be very small because a request generally fits in one single packet. Large
-    times here generally indicate network trouble between the client and
-    haproxy. See "Timers" below for more details.
+    times here generally indicate network issues between the client and haproxy
+    or requests being typed by hand. See "Timers" below for more details.
 
   - "Tw" is the total time in milliseconds spent waiting in the various queues.
     It can be "-1" if the connection was aborted before reaching the queue.
@@ -14926,7 +14926,9 @@ Please refer to the table below for currently defined 
variables :
   |   | %Tc  | Tc                                            | numeric     |
   |   | %Td  | Td = Tt - (Tq + Tw + Tc + Tr)                 | numeric     |
   |   | %Tl  | local_date_time                               | date        |
-  | H | %Tq  | Tq                                            | numeric     |
+  |   | %Th  | connexion handshake time (SSL and proxy proto)| numeric     |
+  | H | %Ti  | idle time before the HTTP request             | numeric     |
+  | H | %Tq  | time to receive the full request              | numeric     |
   | H | %Tr  | Tr                                            | numeric     |
   |   | %Ts  | timestamp                                     | numeric     |
   |   | %Tt  | Tt                                            | numeric     |
@@ -15082,13 +15084,33 @@ Timers provide a great help in troubleshooting 
network problems. All values are
 reported in milliseconds (ms). These timers should be used in conjunction with
 the session termination flags. In TCP mode with "option tcplog" set on the
 frontend, 3 control points are reported under the form "Tw/Tc/Tt", and in HTTP
-mode, 5 control points are reported under the form "Tq/Tw/Tc/Tr/Tt" :
+mode, 5 control points are reported under the form "Tq/Tw/Tc/Tr/Tt". In
+addition, two other measures are provided, "Th" and "Ti".
+
+  - Th: total time to accept tcp connection and execute handshakes for low 
level
+    protocols. Currently, these protocoles are proxy-protocol and SSL. This may
+    only happen once during the whole connection's lifetime. A large time here
+    may indicate that the client only pre-established the connection without
+    speaking, that it is experiencing network issues preventing it from
+    completing a handshake in a reasonable time (eg: MTU issues), or that an
+    SSL handshake was very expensive to compute.
+
+  - Ti: is the idle time before the HTTP request (HTTP mode only). This timer
+    counts between the end of the handshakes and the first byte of the HTTP
+    request. When dealing with a second request in keep-alive mode, it starts
+    to count after the end of the transmission the previous response. Some
+    browsers pre-establish connections to a server in order to reduce the
+    latency of a future request, and keep them pending until they need it. This
+    delay will be reported as the idle time. A value of -1 indicates that
+    nothing was received on the connection.
 
   - Tq: total time to get the client request (HTTP mode only). It's the time
-    elapsed between the moment the client connection was accepted and the
-    moment the proxy received the last HTTP header. The value "-1" indicates
-    that the end of headers (empty line) has never been seen. This happens when
-    the client closes prematurely or times out.
+    elapsed between the first bytes received and the moment the proxy received
+    the empty line marking the end of the HTTP headers. The value "-1"
+    indicates that the end of headers has never been seen. This happens when
+    the client closes prematurely or times out. This time is usually very short
+    since most requests fit in a single packet. A large time may indicate a
+    request typed by hand during a test.
 
   - Tw: total time spent in the queues waiting for a connection slot. It
     accounts for backend queue as well as the server queues, and depends on the
@@ -15116,15 +15138,15 @@ mode, 5 control points are reported under the form 
"Tq/Tw/Tc/Tr/Tt" :
 
   - Tt: total session duration time, between the moment the proxy accepted it
     and the moment both ends were closed. The exception is when the "logasap"
-    option is specified. In this case, it only equals (Tq+Tw+Tc+Tr), and is
-    prefixed with a '+' sign. From this field, we can deduce "Td", the data
+    option is specified. In this case, it only equals (Th+Ti+Tq+Tw+Tc+Tr), and
+    is prefixed with a '+' sign. From this field, we can deduce "Td", the data
     transmission time, by subtracting other timers when valid :
 
-        Td = Tt - (Tq + Tw + Tc + Tr)
+        Td = Tt - (Th + Ti + Tq + Tw + Tc + Tr)
 
     Timers with "-1" values have to be excluded from this equation. In TCP
-    mode, "Tq" and "Tr" have to be excluded too. Note that "Tt" can never be
-    negative.
+    mode, "Ti", "Tq" and "Tr" have to be excluded too. Note that "Tt" can never
+    be negative.
 
 These timers provide precious indications on trouble causes. Since the TCP
 protocol defines retransmit delays of 3, 6, 12... seconds, we know for sure
@@ -15135,17 +15157,17 @@ session has been aborted on timeout.
 
 Most common cases :
 
-  - If "Tq" is close to 3000, a packet has probably been lost between the
-    client and the proxy. This is very rare on local networks but might happen
-    when clients are on far remote networks and send large requests. It may
-    happen that values larger than usual appear here without any network cause.
-    Sometimes, during an attack or just after a resource starvation has ended,
-    haproxy may accept thousands of connections in a few milliseconds. The time
-    spent accepting these connections will inevitably slightly delay processing
-    of other connections, and it can happen that request times in the order of
-    a few tens of milliseconds are measured after a few thousands of new
-    connections have been accepted at once. Using one of the keep-alive modes
-    may display larger request times since "Tq" also measures the time spent
+  - If "Th" or "Ti" are close to 3000, a packet has probably been lost between
+    the client and the proxy. This is very rare on local networks but might
+    happen when clients are on far remote networks and send large requests. It
+    may happen that values larger than usual appear here without any network
+    cause. Sometimes, during an attack or just after a resource starvation has
+    ended, haproxy may accept thousands of connections in a few milliseconds.
+    The time spent accepting these connections will inevitably slightly delay
+    processing of other connections, and it can happen that request times in 
the
+    order of a few tens of milliseconds are measured after a few thousands of
+    new connections have been accepted at once. Using one of the keep-alive
+    modes may display larger idle times since "Ti" measures the time spent
     waiting for additional requests.
 
   - If "Tc" is close to 3000, a packet has probably been lost between the
diff --git a/include/types/log.h b/include/types/log.h
index 25d872e..ce285ab 100644
--- a/include/types/log.h
+++ b/include/types/log.h
@@ -77,6 +77,8 @@ enum {
        LOG_FMT_BYTES,
        LOG_FMT_BYTES_UP,
        LOG_FMT_T,
+       LOG_FMT_TH,
+       LOG_FMT_TI,
        LOG_FMT_TQ,
        LOG_FMT_TW,
        LOG_FMT_TC,
diff --git a/include/types/stream.h b/include/types/stream.h
index 8687726..17e74b8 100644
--- a/include/types/stream.h
+++ b/include/types/stream.h
@@ -97,6 +97,8 @@ struct strm_logs {
        int level;                      /* log level to force + 1 if > 0, -1 = 
no log */
        struct timeval accept_date;     /* date of the stream's accept() in 
user date */
        struct timeval tv_accept;       /* date of the stream's accept() in 
internal date (monotonic) */
+       long t_handshake;               /* hanshake duration, -1 if never 
occurs */
+       long t_idle;                    /* idle duration, -1 if never occurs */
        struct timeval tv_request;      /* date the request arrives, {0,0} if 
never occurs */
        long  t_queue;                  /* delay before the stream gets out of 
the connect queue, -1 if never occurs */
        long  t_connect;                /* delay before the connect() to the 
server succeeds, -1 if never occurs */
diff --git a/src/log.c b/src/log.c
index 4f4ede9..617f926 100644
--- a/src/log.c
+++ b/src/log.c
@@ -125,8 +125,10 @@ static const struct logformat_type logformat_keywords[] = {
        { "ST", LOG_FMT_STATUS, PR_MODE_TCP, LW_RESP, NULL },   /* status code 
*/
        { "T", LOG_FMT_DATEGMT, PR_MODE_TCP, LW_INIT, NULL },   /* date GMT */
        { "Tc", LOG_FMT_TC, PR_MODE_TCP, LW_BYTES, NULL },       /* Tc */
+       { "Th", LOG_FMT_TH, PR_MODE_TCP, LW_BYTES, NULL },       /* Time 
handshake */
+       { "Ti", LOG_FMT_TI, PR_MODE_HTTP, LW_BYTES, NULL },       /* Time idle 
*/
        { "Tl", LOG_FMT_DATELOCAL, PR_MODE_TCP, LW_INIT, NULL },   /* date 
local timezone */
-       { "Tq", LOG_FMT_TQ, PR_MODE_HTTP, LW_BYTES, NULL },       /* Tq */
+       { "Tq", LOG_FMT_TQ, PR_MODE_HTTP, LW_BYTES, NULL },       /* Time for 
receiving valid request */
        { "Tr", LOG_FMT_TR, PR_MODE_HTTP, LW_BYTES, NULL },       /* Tr */
        { "Td", LOG_FMT_TD, PR_MODE_TCP, LW_BYTES, NULL },       /* Td = Tt - 
(Tq + Tw + Tc + Tr) */
        { "Ts", LOG_FMT_TS, PR_MODE_TCP, LW_INIT, NULL },   /* timestamp GMT */
@@ -1620,8 +1622,28 @@ int build_logline(struct stream *s, char *dst, size_t 
maxsize, struct list *list
                                last_isspace = 0;
                                break;
 
+                       case LOG_FMT_TH: // %Th
+                               ret = ltoa_o(s->logs.t_handshake,
+                                               tmplog, dst + maxsize - tmplog);
+                               if (ret == NULL)
+                                       goto out;
+                               tmplog = ret;
+                               last_isspace = 0;
+                               break;
+
+                       case LOG_FMT_TI: // %Ti
+                               ret = ltoa_o(s->logs.t_idle,
+                                               tmplog, dst + maxsize - tmplog);
+                               if (ret == NULL)
+                                       goto out;
+                               tmplog = ret;
+                               last_isspace = 0;
+                               break;
+
                        case LOG_FMT_TQ: // %Tq
-                               ret = ltoa_o(t_request, tmplog, dst + maxsize - 
tmplog);
+                               /* Note: all timers are valid if t_request is 
valid */
+                               ret = ltoa_o((t_request >= 0) ? t_request - 
s->logs.t_idle - s->logs.t_handshake : -1,
+                                            tmplog, dst + maxsize - tmplog);
                                if (ret == NULL)
                                        goto out;
                                tmplog = ret;
diff --git a/src/proto_http.c b/src/proto_http.c
index 3e18a6c..c8df1a4 100644
--- a/src/proto_http.c
+++ b/src/proto_http.c
@@ -2455,6 +2455,12 @@ int http_wait_for_request(struct stream *s, struct 
channel *req, int an_bit)
         * data later, which is much more complicated.
         */
        if (buffer_not_empty(req->buf) && msg->msg_state < HTTP_MSG_ERROR) {
+
+               /* This point is executed when some data is avalaible for 
analysis,
+                * so we log the end of the idle time. */
+               if (s->logs.t_idle == -1)
+                       s->logs.t_idle = tv_ms_elapsed(&s->logs.tv_accept, 
&now) - s->logs.t_handshake;
+
                if (txn->flags & TX_NOT_FIRST) {
                        if (unlikely(!channel_is_rewritable(req))) {
                                if (req->flags & 
(CF_SHUTW|CF_SHUTW_NOW|CF_WRITE_ERROR|CF_WRITE_TIMEOUT))
@@ -5026,6 +5032,8 @@ void http_end_txn_clean_session(struct stream *s)
 
        s->logs.accept_date = date; /* user-visible date for logging */
        s->logs.tv_accept = now;  /* corrected date for internal use */
+       s->logs.t_handshake = 0; /* There are no handshake in keep alive 
connection. */
+       s->logs.t_idle = -1;
        tv_zero(&s->logs.tv_request);
        s->logs.t_queue = -1;
        s->logs.t_connect = -1;
diff --git a/src/stream.c b/src/stream.c
index 3eb5265..b1d7936 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -93,6 +93,10 @@ struct stream *stream_new(struct session *sess, struct task 
*t, enum obj_type *o
        s->logs.level = 0;
        s->logs.accept_date = sess->accept_date; /* user-visible date for 
logging */
        s->logs.tv_accept = sess->tv_accept;   /* corrected date for internal 
use */
+       /* This function is called just after the handshake, so the handshake 
duration is
+        * between the accept time and now. */
+       s->logs.t_handshake = tv_ms_elapsed(&sess->tv_accept, &now);
+       s->logs.t_idle = -1;
        tv_zero(&s->logs.tv_request);
        s->logs.t_queue = -1;
        s->logs.t_connect = -1;
-- 
1.7.12.1

Reflexions around timers in the logs

Reply via email to