[ https://issues.apache.org/jira/browse/KAFKA-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926571#comment-15926571 ]
Guozhang Wang commented on KAFKA-4885: -------------------------------------- [~damianguy] Well, as for the general solution in terms of error handling. I think there is a difference between "streams app has an issue, e.g. divide by zero, serde exception" and "broker cluster has an issue, and no data can be produced / fetched". In the former case I agree that we should usually fail fast; whereas for the latter case I'm not sure if we want Streams app to stop and die whenever there is a (temporary?) broker unavailability. Imagine cross-DC replication / MirrorMaker cases, we usually want such services to keep alive and idle when brokers are unavailable rather than logging a fatal error and shutdown itself. As for the system test itself, I agree that setting a exception handler in streams would be better than setting num.retries to infinity. > processstreamwithcachedstatestore and other streams benchmarks fail > occasionally > --------------------------------------------------------------------------------- > > Key: KAFKA-4885 > URL: https://issues.apache.org/jira/browse/KAFKA-4885 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 0.10.2.0 > Reporter: Eno Thereska > Fix For: 0.11.0.0 > > > test_id: > kafkatest.benchmarks.streams.streams_simple_benchmark_test.StreamsSimpleBenchmarkTest.test_simple_benchmark.test=processstreamwithcachedstatestore.scale=2 > status: FAIL > run time: 14 minutes 58.069 seconds > Streams Test process on ubuntu@worker5 took too long to exit > Traceback (most recent call last): > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 123, in run > data = self.run_test() > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 176, in run_test > return self.test_context.function(self.test) > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py", > line 321, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/benchmarks/streams/streams_simple_benchmark_test.py", > line 86, in test_simple_benchmark > self.driver[num].wait() > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/services/streams.py", > line 102, in wait > self.wait_node(node, timeout_sec) > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/services/streams.py", > line 106, in wait_node > wait_until(lambda: not node.account.alive(pid), timeout_sec=timeout_sec, > err_msg="Streams Test process on " + str(node.account) + " took too long to > exit") > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py", > line 36, in wait_until > raise TimeoutError(err_msg) > TimeoutError: Streams Test process on ubuntu@worker5 took too long to exit > The log contains several lines like: > [2017-03-11 04:52:59,080] DEBUG Attempt to heartbeat failed for group > simple-benchmark-streams-with-storetrue since it is rebalancing. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2017-03-11 04:53:01,987] DEBUG Sending Heartbeat request for group > simple-benchmark-streams-with-storetrue to coordinator worker10:9092 (id: > 2147483646 rack: null) > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2017-03-11 04:53:02,088] DEBUG Attempt to heartbeat failed for group > simple-benchmark-streams-with-storetrue since it is rebalancing. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2017-03-11 04:53:04,995] DEBUG Sending Heartbeat request for group > simple-benchmark-streams-with-storetrue to coordinator worker10:9092 (id: > 2147483646 rack: null) > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > Other tests that fail the same way include: > test_id: > kafkatest.benchmarks.streams.streams_simple_benchmark_test.StreamsSimpleBenchmarkTest.test_simple_benchmark.test=count.scale=2 > test_id: > kafkatest.benchmarks.streams.streams_simple_benchmark_test.StreamsSimpleBenchmarkTest.test_simple_benchmark.test=processstreamwithsink.scale=1 > test_id: > kafkatest.tests.streams.streams_bounce_test.StreamsBounceTest.test_bounce -- This message was sent by Atlassian JIRA (v6.3.15#6346)