Hi All,

I have a question about intended producer behavior if the broker is lost -
do I see a bug or the code works as specified?

What I do and see using trunk:

*a) No message send timeout at all if there is no available broker*
- no broker is started
- consoleproducer is started using --broker-list localhost:9092 --topic
test --request-timeout-ms 2000 --property retries=3
- I try to send some messages. The request doesn't time out, I just see an
endless amount of log entries like this:
[2018-03-27 10:04:42,084] WARN [Producer clientId=console-producer]
Connection to node -1 could not be established. Broker may not be
available. (org.apache.kafka.clients.NetworkClient)

I would expect that the producer eventually times out, preferably retrying
sending 3 times.

*b) No retries if the last broker disappears after successful communication*
- broker is started
- consoleproducer is started with the same config as in a)
- 1 message is sent successfully
- broker is stopped - producer starts to add log entries like
[2018-03-27 10:09:11,597] WARN [Producer clientId=console-producer]
Connection to node 0 could not be established. Broker may not be available.
(org.apache.kafka.clients.NetworkClient)
- I try to send another message
- it times out after ~2 seconds
[2018-03-27 10:09:17,442] ERROR Error when sending message to topic test
with key: null, value: 1 bytes with error:
(org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for
test-0: 2018 ms has passed since batch creation plus linger time
- the producer *does not retry* sending

Some debugging have shown that after the connection was broken the producer
will attempt to fetch metadata. As there are no more available brokers this
will not be possible.
In this case it does not retry fetching of the metadata but it considers
the batch expired.

My expectation would be that the producer times out but it retries sending.

I've spent quite some time trying to derive what's the correct behavior
according to KIP-19
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient>
and
jiras/pull requests related to the topic (e.g. KAFKA-2805
<https://issues.apache.org/jira/browse/KAFKA-2805> and KAFKA-3388
<https://issues.apache.org/jira/browse/KAFKA-3388>) but it's still not 100%
clear to me if the code works as intended.

Could someone help me decide this question?

Many thanks in advance,
Sandor

Reply via email to