ben fleis created KAFKA-706:
-------------------------------

             Summary: broker appears to be encoding ProduceResponse, but never 
sending it
                 Key: KAFKA-706
                 URL: https://issues.apache.org/jira/browse/KAFKA-706
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 0.8
         Environment: reproduced on both Mac OS and RH linux, via private 
node.js client
            Reporter: ben fleis


By all appearances, I seem to be able to convince a broker to periodically 
encode, but never transmit, a ProduceResponse.  Unfortunately my client is 
proprietary, but I will share it with Neha via LI channels.  But I will 
describe what's going on in the hopes that there's another trivial way to 
reproduce it.  (I did search through JIRA, and haven't found anything that 
looks like this.)

I am running a single instance zookeeper and single broker.  I have a client 
that generates configurable amounts of data, tracking what is produced (both 
sent and ACK'd), and what is consumed.  I was noticing that when using high 
transfer rates via high frequency single messages, my unack'd queue appeared to 
be getting continuously larger.  So, I outfitted my client to log more 
information about correlation ids at various stages, and modified the kafka 
ProducerRequest/ProducerResponse to log (de)serialization of the same.  I then 
used tcpdump to intercept all communications between my client and the broker.  
Finally, I configured my client to generate 1 message per ~10ms, each payload 
being approximately 33 bytes; requestAckTimeout was set to 2000ms, and 
requestAcksRequired was set to 1.  I used 10ms as I found that 5ms or less 
caused my unacked queue to build up due to system speed -- it simply couldn't 
keep up.  10ms keeps the load high, but just manageable.  YMMV with that param. 
 All of this is done on a single host, over loopback.  I ran it on both my 
airbook, and a well setup RH linux box, and found the same problem.

At startup, my system logged "expired" requests - meaning reqs that were sent, 
but for which no ACK, positive or negative, was seen from the broker, within 
1.25x the requestAckTimeout (ie, 2500ms).  I would let it settle until the 
unacked queue was stable at or around 0.

What I found is this: ACKs are normally generated within milliseconds.  This 
was demonstrated by my logging added to the scala ProducerRe* classes, and they 
are normally seen quickly by my client.  But when the actual error occurs, 
namely that a request is ignored, the ProducerResponse class *does* encode the 
correct correlationId; however, a response containing that ID is never sent 
over the network, as evidenced by my tcpdump traces.  In my experience this 
would take anywhere from 3-15 seconds to occur after the system was warm, 
meaning that it's 1 out of several hundred on average that shows the condition.

While I can't attach my client code, I could attach logs; but since my 
intention is to share the code with LI people, I will wait to see if that's 
useful here.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to