Re: Kafka retention bug (?)

Nicholas Feinberg Wed, 15 May 2024 15:34:20 -0700

Thank you for the quick response! :)

I've filed KAFKA-16779 <https://issues.apache.org/jira/browse/KAFKA-16779>
to track the issue, with the information you requested. Please let me know
if I can provide anything further.


On Tue, May 14, 2024 at 8:28 PM Luke Chen <show...@gmail.com> wrote:

> Hi Nicholas,
>
> I didn't know anything in v3.7.0 would cause this issue.
> It would be good if you could open a JIRA for it.
> Some info to be provided:
> 1. You said "in the past", what version of Kafka was it using?
> 2. What is your broker configuration?
> 3. KRaft mode? Combined mode? (controller + broker node?)
> 4. There's no much info in the gist link. It would be great if you could
> attach the brokers logs for investigation.
>
> Thanks.
> Luke
>
>
> On Wed, May 15, 2024 at 2:46 AM Nicholas Feinberg <nicho...@liftoff.io>
> wrote:
>
>> Hello!
>>
>> We recently upgraded our Kafka cluster to 3.7. This cluster's topics are
>> set to have four days of retention (345600000 ms).
>>
>> In the past, when we've temporarily lowered retention for ops, we've seen
>> disk usage return to normal four days later, as expected.
>>
>> [image: image.png]
>>
>> However, after our latest round of ops, we're now seeing disk usage
>> *continue* to grow on most brokers after those four days pass, despite a 
>> *decrease
>> *in incoming data. This usage increased until day six.
>>
>> [image: kafka-ooms.png]
>> On day *six* after 4d retention was restored, several brokers began to
>> crash, with the following error:
>>
>> # There is insufficient memory for the Java Runtime Environment to
>>> continue.
>>> # Native memory allocation (mmap) failed to map 16384 bytes for
>>> committing reserved memory.
>>
>>
>> (Details:
>> https://gist.github.com/PleasingFungus/3e0cf6b58a4f3eee2171ff91b1aff42a
>> .)
>>
>> These hosts had ~170GiB of free memory available. We saw no signs of
>> pressure on either system or JVM heap memory before or after they reported
>> this error. Committed memory seems to be around 10%, so this doesn't seem
>> to be an overcommit issue.
>>
>> The hosts which crashed in this fashion freed large amounts of disk after
>> they came back up. This returned them to the usage that we'd expect.
>>
>> Manually restarting Kafka on a broker likewise resulted in its disk usage
>> dropping to the 4d retention level.
>>
>> Other brokers' disk usage seems to have stabilized.
>>
>> I've spent some time searching for bugs in the Jira or other posts which
>> describe this behavior, but have come up empty.
>>
>> *Questions*:
>>
>>    - Has anyone else seen an issue similar to this?
>>    - What are some ways that we could confirm whether Kafka is failing
>>    to clear expired logs from disk?
>>    - What could cause the mmap failures that we saw?
>>    - Would it be helpful for us to file a Jira issue or issues for this,
>>    and what details should we include if so?
>>
>> Cheers,
>> Nicholas Feinberg
>>
>

Re: Kafka retention bug (?)

Reply via email to