Hi
I have observed a peculiar scenario in production environment in which a
mapper task for a particular topic-partition combination always fails
with the exception 'Task attempt failed to report status for 600 seconds'.
When I dug deep I found it stucks at either fetch() method/getNext
method of Kafkareader.
Things which I tried:
-------------------------
1. Network and /etc/hosts entries are checked. They are fine.
2. Machine on which that particular partition resides, there are another
partition as well and there is no problem in reading those partitions.
So it is not machine specific or network specific issue.
3. Tried increasing timeout parameters and changing buffering parameters.
4. Records are zlib compressed. I tried Kafka console-consumer but
couldn't verify with it as data was large.
Here are relevant configs:
-----------------------------------
kafka.client.name=camus1
# Fetch Request Parameters
kafka.fetch.buffer.size=1048576
#kafka.fetch.request.correlationid=
kafka.fetch.request.max.wait=100000
#kafka.fetch.request.min.bytes=
socket.receive.buffer.bytes=1048576
fetch.message.max.bytes=10485760
# Connection parameters.
kafka.brokers=<list of ips>
kafka.timeout.value=30000