Hi All, I have a question about intended producer behavior if the broker is lost - do I see a bug or the code works as specified?
What I do and see using trunk: *a) No message send timeout at all if there is no available broker* - no broker is started - consoleproducer is started using --broker-list localhost:9092 --topic test --request-timeout-ms 2000 --property retries=3 - I try to send some messages. The request doesn't time out, I just see an endless amount of log entries like this: [2018-03-27 10:04:42,084] WARN [Producer clientId=console-producer] Connection to node -1 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient) I would expect that the producer eventually times out, preferably retrying sending 3 times. *b) No retries if the last broker disappears after successful communication* - broker is started - consoleproducer is started with the same config as in a) - 1 message is sent successfully - broker is stopped - producer starts to add log entries like [2018-03-27 10:09:11,597] WARN [Producer clientId=console-producer] Connection to node 0 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient) - I try to send another message - it times out after ~2 seconds [2018-03-27 10:09:17,442] ERROR Error when sending message to topic test with key: null, value: 1 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback) org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for test-0: 2018 ms has passed since batch creation plus linger time - the producer *does not retry* sending Some debugging have shown that after the connection was broken the producer will attempt to fetch metadata. As there are no more available brokers this will not be possible. In this case it does not retry fetching of the metadata but it considers the batch expired. My expectation would be that the producer times out but it retries sending. I've spent quite some time trying to derive what's the correct behavior according to KIP-19 <https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient> and jiras/pull requests related to the topic (e.g. KAFKA-2805 <https://issues.apache.org/jira/browse/KAFKA-2805> and KAFKA-3388 <https://issues.apache.org/jira/browse/KAFKA-3388>) but it's still not 100% clear to me if the code works as intended. Could someone help me decide this question? Many thanks in advance, Sandor