[ 
https://issues.apache.org/jira/browse/KAFKA-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352321#comment-14352321
 ] 

Jun Rao commented on KAFKA-2010:
--------------------------------

The long delay always happens during the shutdown of a multi-node Kafka cluster 
with the controlled shutdown disabled. The typical pattern is the following. 

1. Broker X, which is the controller, initiates the shutdown process.
2. The socket server of broker X is shut down.

After step 2, two things are happening in parallel.
A. Another broker is still trying to fetch data from broker X since the leaders 
on broker X haven't been moved. The replica fetcher thread keeps trying to 
connect to broker X, but keeps failing. Initially, the failing happens 
immediately. However, after a few seconds, the connecting will block for the 
socket connection timeout (configured to 1.5 secs in the unit test) and then 
fail.
B. In broker X,
B1. The shutdown of the controller is initiated. One of the steps is to shut 
down the RequestSendThread that's responsible for sending requests from the 
controller to broker X itself.
B2. The RequestSendThread is in the middle of sending a controller request to 
broker X. Since the socket server of broker X is already shut down. The sending 
should fail immediately. However, what happens is that the RequestSendThread 
blocks on connecting to broker X for the socket connect timeout, which has the 
default value of 30 secs.

Because of this, the shutdown of broker X is delayed by at least 30 secs. This 
seems to happen more likely after kafka-1971. Before kafka-1971, we de-register 
a broker's ZK registration before shutting down the socket server in step 2 
above. After kafka-1971, broker de-registers itself from ZK in the last step in 
the shutdown process (so it happens after shutting down the socket server).

The following is the output from a tcpdump. As you can see, starting from 
14:29:24.279037, the connecting from both the replica fetcher thread and the 
RequestSendThread to broker X's port (58693) starts to fail. Initially, a TCP 
RESET is sent immediately after a TCP SYN (which matches the TCP spec). 
However, starting from 14:29:25.142164, no RESET is sent after SYN. Therefore 
the connecting failure is only detected after the timeout.

It's not very clear to me why RESET is not sent after some time. Anyone has a 
good explanation of this behavior? A few more details.
(1) This happens on my mac. There is no firewall configured.
(2) The unit test open around 20 consecutive ports.
(3) The reconnect in the replica fetch thread has no backoff (it probably 
should). So, it will try to reconnect immediately after the previous connecting 
fails.
(4) The reconnect in the RequestSendThread has a 300ms backoff.
(5) There seems to be a gap between 14:29:24.477485 and 14:29:25.142164 in the 
tcpdump output. It's not clear to me why since in that time window, both the 
replica fetch thread and
RequestSendThread are still connecting and failing.

tcpdump output:
14:29:24.279037 IP localhost.58973 > localhost.58693: Flags [S], seq 
3455869244, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706836 ecr 0,sackOK,eol], length 0
14:29:24.361327 IP localhost.58974 > localhost.58693: Flags [S], seq 
3041388581, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 
1107706906 ecr 0,sackOK,eol], length 0
14:29:24.361376 IP localhost.58693 > localhost.58974: Flags [R.], seq 0, ack 
3041388582, win 0, length 0
14:29:24.405285 IP localhost.58977 > localhost.58693: Flags [S], seq 
1164560537, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 
1107706945 ecr 0,sackOK,eol], length 0
14:29:24.405384 IP localhost.58693 > localhost.58977: Flags [R.], seq 0, ack 
1164560538, win 0, length 0
14:29:24.439743 IP localhost.58973 > localhost.58693: Flags [S], seq 
3455869244, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706973 ecr 0,sackOK,eol], length 0
14:29:24.439844 IP localhost.58693 > localhost.58973: Flags [R.], seq 0, ack 
3455869245, win 0, length 0
14:29:24.442677 IP localhost.58978 > localhost.58693: Flags [S], seq 
1119137787, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706975 ecr 0,sackOK,eol], length 0
14:29:24.442693 IP localhost.58693 > localhost.58978: Flags [R.], seq 0, ack 
1119137788, win 0, length 0
14:29:24.443898 IP localhost.58979 > localhost.58693: Flags [S], seq 
1843362092, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706977 ecr 0,sackOK,eol], length 0
14:29:24.443912 IP localhost.58693 > localhost.58979: Flags [R.], seq 0, ack 
1843362093, win 0, length 0
14:29:24.445085 IP localhost.58980 > localhost.58693: Flags [S], seq 
2336267591, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706978 ecr 0,sackOK,eol], length 0
14:29:24.445100 IP localhost.58693 > localhost.58980: Flags [R.], seq 0, ack 
2336267592, win 0, length 0
14:29:24.446303 IP localhost.58981 > localhost.58693: Flags [S], seq 
3297569415, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706979 ecr 0,sackOK,eol], length 0
14:29:24.446321 IP localhost.58693 > localhost.58981: Flags [R.], seq 0, ack 
3297569416, win 0, length 0
14:29:24.447565 IP localhost.58982 > localhost.58693: Flags [S], seq 
2742157513, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706980 ecr 0,sackOK,eol], length 0
14:29:24.447575 IP localhost.58693 > localhost.58982: Flags [R.], seq 0, ack 
2742157514, win 0, length 0
14:29:24.448901 IP localhost.58983 > localhost.58693: Flags [S], seq 
3569996507, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706981 ecr 0,sackOK,eol], length 0
14:29:24.448913 IP localhost.58693 > localhost.58983: Flags [R.], seq 0, ack 
3569996508, win 0, length 0
14:29:24.450233 IP localhost.58984 > localhost.58693: Flags [S], seq 
2189096629, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706983 ecr 0,sackOK,eol], length 0
14:29:24.450254 IP localhost.58693 > localhost.58984: Flags [R.], seq 0, ack 
2189096630, win 0, length 0
14:29:24.451602 IP localhost.58985 > localhost.58693: Flags [S], seq 
3668623372, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706984 ecr 0,sackOK,eol], length 0
14:29:24.451616 IP localhost.58693 > localhost.58985: Flags [R.], seq 0, ack 
3668623373, win 0, length 0
14:29:24.453369 IP localhost.58986 > localhost.58693: Flags [S], seq 
3841327697, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706986 ecr 0,sackOK,eol], length 0
14:29:24.453415 IP localhost.58693 > localhost.58986: Flags [R.], seq 0, ack 
3841327698, win 0, length 0
14:29:24.455264 IP localhost.58987 > localhost.58693: Flags [S], seq 
1397446064, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706987 ecr 0,sackOK,eol], length 0
14:29:24.455285 IP localhost.58693 > localhost.58987: Flags [R.], seq 0, ack 
1397446065, win 0, length 0
14:29:24.456635 IP localhost.58988 > localhost.58693: Flags [S], seq 658704010, 
win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706989 ecr 
0,sackOK,eol], length 0
14:29:24.456655 IP localhost.58693 > localhost.58988: Flags [R.], seq 0, ack 
658704011, win 0, length 0
14:29:24.457805 IP localhost.58989 > localhost.58693: Flags [S], seq 108288592, 
win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706990 ecr 
0,sackOK,eol], length 0
14:29:24.457821 IP localhost.58693 > localhost.58989: Flags [R.], seq 0, ack 
108288593, win 0, length 0
14:29:24.458936 IP localhost.58990 > localhost.58693: Flags [S], seq 834439671, 
win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706991 ecr 
0,sackOK,eol], length 0
14:29:24.458976 IP localhost.58693 > localhost.58990: Flags [R.], seq 0, ack 
834439672, win 0, length 0
14:29:24.460094 IP localhost.58991 > localhost.58693: Flags [S], seq 
3341628230, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706992 ecr 0,sackOK,eol], length 0
14:29:24.460108 IP localhost.58693 > localhost.58991: Flags [R.], seq 0, ack 
3341628231, win 0, length 0
14:29:24.461227 IP localhost.58992 > localhost.58693: Flags [S], seq 
1072784276, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706993 ecr 0,sackOK,eol], length 0
14:29:24.461242 IP localhost.58693 > localhost.58992: Flags [R.], seq 0, ack 
1072784277, win 0, length 0
14:29:24.463193 IP localhost.58993 > localhost.58693: Flags [S], seq 
3613406588, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706994 ecr 0,sackOK,eol], length 0
14:29:24.463213 IP localhost.58693 > localhost.58993: Flags [R.], seq 0, ack 
3613406589, win 0, length 0
14:29:24.464639 IP localhost.58994 > localhost.58693: Flags [S], seq 
2569113338, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706995 ecr 0,sackOK,eol], length 0
14:29:24.464658 IP localhost.58693 > localhost.58994: Flags [R.], seq 0, ack 
2569113339, win 0, length 0
14:29:24.465911 IP localhost.58995 > localhost.58693: Flags [S], seq 897053363, 
win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706996 ecr 
0,sackOK,eol], length 0
14:29:24.465924 IP localhost.58693 > localhost.58995: Flags [R.], seq 0, ack 
897053364, win 0, length 0
14:29:24.467197 IP localhost.58996 > localhost.58693: Flags [S], seq 
1556778022, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706997 ecr 0,sackOK,eol], length 0
14:29:24.467210 IP localhost.58693 > localhost.58996: Flags [R.], seq 0, ack 
1556778023, win 0, length 0
14:29:24.468455 IP localhost.58997 > localhost.58693: Flags [S], seq 522890491, 
win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 1107706998 ecr 
0,sackOK,eol], length 0
14:29:24.468469 IP localhost.58693 > localhost.58997: Flags [R.], seq 0, ack 
522890492, win 0, length 0
14:29:24.469650 IP localhost.58998 > localhost.58693: Flags [S], seq 
3272035680, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107706999 ecr 0,sackOK,eol], length 0
14:29:24.469661 IP localhost.58693 > localhost.58998: Flags [R.], seq 0, ack 
3272035681, win 0, length 0
14:29:24.470846 IP localhost.58999 > localhost.58693: Flags [S], seq 
1810034270, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107707000 ecr 0,sackOK,eol], length 0
14:29:24.470861 IP localhost.58693 > localhost.58999: Flags [R.], seq 0, ack 
1810034271, win 0, length 0
14:29:24.472034 IP localhost.59000 > localhost.58693: Flags [S], seq 
2410840076, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107707001 ecr 0,sackOK,eol], length 0
14:29:24.472046 IP localhost.58693 > localhost.59000: Flags [R.], seq 0, ack 
2410840077, win 0, length 0
14:29:24.473404 IP localhost.59001 > localhost.58693: Flags [S], seq 
1786606508, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107707002 ecr 0,sackOK,eol], length 0
14:29:24.473468 IP localhost.58693 > localhost.59001: Flags [R.], seq 0, ack 
1786606509, win 0, length 0
14:29:24.476083 IP localhost.59002 > localhost.58693: Flags [S], seq 
2237271129, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107707004 ecr 0,sackOK,eol], length 0
14:29:24.476098 IP localhost.58693 > localhost.59002: Flags [R.], seq 0, ack 
2237271130, win 0, length 0
14:29:24.477475 IP localhost.59003 > localhost.58693: Flags [S], seq 
1523915848, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107707006 ecr 0,sackOK,eol], length 0
14:29:24.477485 IP localhost.58693 > localhost.59003: Flags [R.], seq 0, ack 
1523915849, win 0, length 0
14:29:25.142164 IP localhost.59224 > localhost.58693: Flags [S], seq 
2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 
1107707572 ecr 0,sackOK,eol], length 0
14:29:25.142176 IP localhost.59223 > localhost.58693: Flags [S], seq 
4039966681, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107707572 ecr 0,sackOK,eol], length 0
14:29:25.342565 IP localhost.59224 > localhost.58693: Flags [S], seq 
2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 
1107707748 ecr 0,sackOK,eol], length 0
14:29:25.342578 IP localhost.59223 > localhost.58693: Flags [S], seq 
4039966681, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107707748 ecr 0,sackOK,eol], length 0
14:29:25.443273 IP localhost.59224 > localhost.58693: Flags [S], seq 
2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 
1107707848 ecr 0,sackOK,eol], length 0
14:29:25.443319 IP localhost.59223 > localhost.58693: Flags [S], seq 
4039966681, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107707848 ecr 0,sackOK,eol], length 0
14:29:25.644115 IP localhost.59224 > localhost.58693: Flags [S], seq 
2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 
1107708047 ecr 0,sackOK,eol], length 0
14:29:25.644126 IP localhost.59223 > localhost.58693: Flags [S], seq 
4039966681, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107708047 ecr 0,sackOK,eol], length 0
14:29:25.745269 IP localhost.59224 > localhost.58693: Flags [S], seq 
2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 
1107708148 ecr 0,sackOK,eol], length 0
14:29:25.846262 IP localhost.59223 > localhost.58693: Flags [S], seq 
4039966681, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107708248 ecr 0,sackOK,eol], length 0
14:29:25.946931 IP localhost.59224 > localhost.58693: Flags [S], seq 
2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 
1107708348 ecr 0,sackOK,eol], length 0
14:29:26.249474 IP localhost.59223 > localhost.58693: Flags [S], seq 
4039966681, win 65535, options [mss 16344,nop,wscale 3,nop,nop,TS val 
1107708650 ecr 0,sackOK,eol], length 0
14:29:26.350322 IP localhost.59224 > localhost.58693: Flags [S], seq 
2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 
1107708750 ecr 0,sackOK,eol], length 0
14:29:27.255626 IP localhost.59224 > localhost.58693: Flags [S], seq 
2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 
1107709606 ecr 0,sackOK,eol], length 0
14:29:28.967861 IP localhost.59224 > localhost.58693: Flags [S], seq 
2276013461, win 65535, options [mss 16344,nop,wscale 4,nop,nop,TS val 
1107711260 ecr 0,sackOK,eol], length 0
14:29:32.293864 IP localhost.59224 > localhost.58693: Flags [S], seq 
2276013461, win 65535, options [mss 16344,sackOK,eol], length 0
14:29:38.938614 IP localhost.59224 > localhost.58693: Flags [S], seq 
2276013461, win 65535, options [mss 16344,sackOK,eol], length 0
14:29:45.481879 IP localhost.59224 > localhost.58693: Flags [S], seq 
2276013461, win 65535, options [mss 16344,sackOK,eol], length 0


> unit tests sometimes are slow during shutdown
> ---------------------------------------------
>
>                 Key: KAFKA-2010
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2010
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jun Rao
>            Assignee: Jun Rao
>             Fix For: 0.8.3
>
>
> Our unit tests in trunk seem to be slower than before. The slowness seems to 
> be due to a handful of tests.
> For example, if you run the following test,  sometimes it can take more than 
> 40
> secs, while normally it takes less than 10 secs.
> ./gradlew -i cleanTest core:test --tests 
> kafka.admin.AddPartitionsTest.testIncrementPartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to