[ 
https://issues.apache.org/jira/browse/KAFKA-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akhilesh Dubey updated KAFKA-13635:
-----------------------------------
    Description: 
While working with 2.7.1, we experienced offset reset on some consumer groups 
after a disk full issue (the actual underlying issue was an uncontrolled kafka 
and a machine shutdown).

When the machine and kafka brokers were restarted, consumer applications 
received a {{Found no committed offset for partition <xyz>}} which triggered 
offset reset which in our case was set to earliest - {{{}Resetting offset for 
partition <xyz>{}}}.

On further investigation, we noticed that {{GroupMetadataManager}} silently 
handled an offset load issue. 
ERROR [GroupMetadataManager brokerId=1] Error loading offsets from 
__consumer_offsets-33 (kafka.coordinator.group.GroupMetadataManager)
org.apache.kafka.common.errors.CorruptRecordException: Record size 0 is less 
than the minimum record overhead (14)
There's nothing wrong here as the uncontrolled shutdown and possibly pagecache 
issues could have led to disk flush issues and GroupCoordinator cannot do much 
if the offsets themselves are missing.

I would like to request a feature to stop progress/retry if 
{{__consumer_offsets}} partition fails to load.

  was:
While working with 6.1.1, we experienced offset reset on some consumer groups 
after a disk full issue (the actual underlying issue was an uncontrolled kafka 
and a machine shutdown).

When the machine and kafka brokers were restarted, consumer applications 
received a {{Found no committed offset for partition <xyz>}} which triggered 
offset reset which in our case was set to earliest - {{{}Resetting offset for 
partition <xyz>{}}}.

On further investigation, we noticed that {{GroupMetadataManager}} silently 
handled an offset load issue. 
ERROR [GroupMetadataManager brokerId=1] Error loading offsets from 
__consumer_offsets-33 (kafka.coordinator.group.GroupMetadataManager)
org.apache.kafka.common.errors.CorruptRecordException: Record size 0 is less 
than the minimum record overhead (14)
There's nothing wrong here as the uncontrolled shutdown and possibly pagecache 
issues could have led to disk flush issues and GroupCoordinator cannot do much 
if the offsets themselves are missing.

I would like to request a feature to stop progress/retry if 
{{__consumer_offsets}} partition fails to load.


> Make Consumer Group Protocol resilient to disk issues with __consumer_offsets 
> ------------------------------------------------------------------------------
>
>                 Key: KAFKA-13635
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13635
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Akhilesh Dubey
>            Priority: Minor
>
> While working with 2.7.1, we experienced offset reset on some consumer groups 
> after a disk full issue (the actual underlying issue was an uncontrolled 
> kafka and a machine shutdown).
> When the machine and kafka brokers were restarted, consumer applications 
> received a {{Found no committed offset for partition <xyz>}} which triggered 
> offset reset which in our case was set to earliest - {{{}Resetting offset for 
> partition <xyz>{}}}.
> On further investigation, we noticed that {{GroupMetadataManager}} silently 
> handled an offset load issue. 
> ERROR [GroupMetadataManager brokerId=1] Error loading offsets from 
> __consumer_offsets-33 (kafka.coordinator.group.GroupMetadataManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size 0 is less 
> than the minimum record overhead (14)
> There's nothing wrong here as the uncontrolled shutdown and possibly 
> pagecache issues could have led to disk flush issues and GroupCoordinator 
> cannot do much if the offsets themselves are missing.
> I would like to request a feature to stop progress/retry if 
> {{__consumer_offsets}} partition fails to load.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to