Re: Getting very poor performance from the new Kafka consumer

Rajiv Kurian Mon, 25 Jan 2016 21:47:19 -0800

The exception seems to be thrown here
https://github.com/apache/kafka/blob/0.9.0/clients/src/main/java/org/apache/kafka/common/record/MemoryRecords.java#L236


Is this not expected to hit often?

On Mon, Jan 25, 2016 at 9:22 PM, Rajiv Kurian <ra...@signalfx.com> wrote:

> Wanted to add that we are not using auto commit since we use custom
> partition assignments. In fact we never call  consumer.commitAsync() or
> consumer.commitSync() calls. My assumption is that since we store our own
> offsets these calls are not necessary. Hopefully this is not responsible
> for the poor performance.
>
> On Mon, Jan 25, 2016 at 9:20 PM, Rajiv Kurian <ra...@signalfx.com> wrote:
>
>> We are using the new kafka consumer with the following config (as logged
>> by kafka)
>>
>> metric.reporters = []
>>
>>         metadata.max.age.ms = 300000
>>
>>         value.deserializer = class
>> org.apache.kafka.common.serialization.ByteArrayDeserializer
>>
>>         group.id = myGroup.id
>>
>>         partition.assignment.strategy = [org.apache.kafka.clients.
>> consumer.RangeAssignor]
>>
>>         reconnect.backoff.ms = 50
>>
>>         sasl.kerberos.ticket.renew.window.factor = 0.8
>>
>>         max.partition.fetch.bytes = 2097152
>>
>>         bootstrap.servers = [myBrokerList]
>>
>>         retry.backoff.ms = 100
>>
>>         sasl.kerberos.kinit.cmd = /usr/bin/kinit
>>
>>         sasl.kerberos.service.name = null
>>
>>         sasl.kerberos.ticket.renew.jitter = 0.05
>>
>>         ssl.keystore.type = JKS
>>
>>         ssl.trustmanager.algorithm = PKIX
>>
>>         enable.auto.commit = false
>>
>>         ssl.key.password = null
>>
>>         fetch.max.wait.ms = 1000
>>
>>         sasl.kerberos.min.time.before.relogin = 60000
>>
>>         connections.max.idle.ms = 540000
>>
>>         ssl.truststore.password = null
>>
>>         session.timeout.ms = 30000
>>
>>         metrics.num.samples = 2
>>
>>         client.id =
>>
>>         ssl.endpoint.identification.algorithm = null
>>
>>         key.deserializer = class sf.kafka.VoidDeserializer
>>
>>         ssl.protocol = TLS
>>
>>         check.crcs = true
>>
>>         request.timeout.ms = 40000
>>
>>         ssl.provider = null
>>
>>         ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
>>
>>         ssl.keystore.location = null
>>
>>         heartbeat.interval.ms = 3000
>>
>>         auto.commit.interval.ms = 5000
>>
>>         receive.buffer.bytes = 32768
>>
>>         ssl.cipher.suites = null
>>
>>         ssl.truststore.type = JKS
>>
>>         security.protocol = PLAINTEXT
>>
>>         ssl.truststore.location = null
>>
>>         ssl.keystore.password = null
>>
>>         ssl.keymanager.algorithm = SunX509
>>
>>         metrics.sample.window.ms = 30000
>>
>>         fetch.min.bytes = 512
>>
>>         send.buffer.bytes = 131072
>>
>>         auto.offset.reset = earliest
>>
>>
>> We use the consumer.assign() feature to assign a list of partitions and
>> call poll in a loop.  We have the following setup:
>>
>> 1. The messages have no key and we use the byte array deserializer to get
>> byte arrays from the config.
>>
>> 2. The messages themselves are on an average about 75 bytes. We get this
>> number by diving the Kafka broker bytes-in metric by the messages-in metric.
>>
>> 3. Each consumer is assigned about 64 partitions of the same topic spread
>> across three brokers.
>>
>> 4. We get very few messages per second maybe around 1-2 messages across
>> all partitions on a client right now.
>>
>> 5. We have no compression on the topic.
>>
>> Our run loop looks something like this
>>
>> while (isRunning()) {
>>
>> ConsumerRecords<Void, byte[]> records = null;
>>
>>         try {
>>
>>             // Here timeout is about 10 seconds, so it is pretty big.
>>
>>             records = consumer.poll(timeout);
>>
>>         } catch (Exception e) {
>>
>>             logger.error("Exception polling Kafka ", e);
>>
>>             records = null;
>>
>>         }
>>
>>         if (records != null) {
>>
>>             for (ConsumerRecord<Void, byte[]> record : records) {
>>
>>                // The handler puts the byte array on a very fast ring
>> buffer so it barely takes any time.
>>
>>                 handler.handleMessage(ByteBuffer.wrap(record.value()));
>>
>>             }
>>
>>         }
>>
>> }
>>
>>
>>
>> With this setup our performance has taken a horrendous hit as soon as we
>> started this one thread that just polls kafka in a loop.
>>
>> I profiled the application using Java Mission Control and have a few
>> insights.
>>
>> 1. There doesn't seem to be a single hotspot. The consumer just ends up
>> using a lot of CPU for handing such a low number of messages. Our process
>> was using 16% CPU before we added a single consumer and it went to 25% and
>> above after. That's an increase of over 50% from a single consumer getting
>> a single digit number of small messages per second. Here is an attachment
>> of the cpu usage breakdown in the consumer (the namespace is different
>> because we shade the kafka jar before using it) -
>> http://imgur.com/tHjdVnM  We've used bigger timeouts (100 seconds odd)
>> and that doesn't seem to make much of a difference either.
>>
>> 2. It also seems like Kafka throws a ton of EOFExceptions. I am not sure
>> whether this is expected but this seems like it would completely kill
>> performance. Here is the exception tab of Java mission control.
>> http://imgur.com/X3KSn37 That is 1.8 mn exceptions over a period of 3
>> minutes which is about 10 thousand exceptions per second! The exception
>> stack trace shows that it originates from the poll call. I don't understand
>> how it can throw so many exceptions given I call poll it with a timeout of
>> 10 seconds and get messages at about 1 per second.
>>
>> 3. The single thread seems to allocate a lot too. The single thread is
>> responsible for 17.87% of our entire JVM allocation rate. Most of what it
>> allocates seems to be those same EOFExceptions. Here is a chart showing the
>> single thread's allocation proportion: http://imgur.com/GNUJQsz Here is
>> a chart that shows a breakdown of the allocations:
>> http://imgur.com/YjCXljE About 20% of the allocations are for the
>> EOFExceptions. This seems kind of crazy especially given that this happens
>> about 10 thousand times a second. The rest of the allocations seem to be
>> spread all over but again seem excessive given how we are getting very few
>> messages.
>>
>> As a comparison, we also run a wrapper over the old SimpleConsumer that
>> gets a lot more data (10 -15 thousand 70 byte messages/sec on a different
>> topic) and it is able to handle that load without much trouble. At this
>> moment we are completely puzzled by this performance. Most of it does seem
>> to be due to the crazy volumes of exceptions. Note: Our messages seem to
>> all be making through. The exceptions are caught by Kafka's stack and never
>> bubble though to us.
>>
>> Are we doing anything wrong with how we are using the new consumer
>> (longer timeouts of a 100 second odd don't seem to help)?
>>
>> Thanks in advance,
>>
>> Rajiv
>>
>>
>>
>

Re: Getting very poor performance from the new Kafka consumer

Reply via email to