Thank you for the quick response! :) I've filed KAFKA-16779 <https://issues.apache.org/jira/browse/KAFKA-16779> to track the issue, with the information you requested. Please let me know if I can provide anything further.
On Tue, May 14, 2024 at 8:28 PM Luke Chen <show...@gmail.com> wrote: > Hi Nicholas, > > I didn't know anything in v3.7.0 would cause this issue. > It would be good if you could open a JIRA for it. > Some info to be provided: > 1. You said "in the past", what version of Kafka was it using? > 2. What is your broker configuration? > 3. KRaft mode? Combined mode? (controller + broker node?) > 4. There's no much info in the gist link. It would be great if you could > attach the brokers logs for investigation. > > Thanks. > Luke > > > On Wed, May 15, 2024 at 2:46 AM Nicholas Feinberg <nicho...@liftoff.io> > wrote: > >> Hello! >> >> We recently upgraded our Kafka cluster to 3.7. This cluster's topics are >> set to have four days of retention (345600000 ms). >> >> In the past, when we've temporarily lowered retention for ops, we've seen >> disk usage return to normal four days later, as expected. >> >> [image: image.png] >> >> However, after our latest round of ops, we're now seeing disk usage >> *continue* to grow on most brokers after those four days pass, despite a >> *decrease >> *in incoming data. This usage increased until day six. >> >> [image: kafka-ooms.png] >> On day *six* after 4d retention was restored, several brokers began to >> crash, with the following error: >> >> # There is insufficient memory for the Java Runtime Environment to >>> continue. >>> # Native memory allocation (mmap) failed to map 16384 bytes for >>> committing reserved memory. >> >> >> (Details: >> https://gist.github.com/PleasingFungus/3e0cf6b58a4f3eee2171ff91b1aff42a >> .) >> >> These hosts had ~170GiB of free memory available. We saw no signs of >> pressure on either system or JVM heap memory before or after they reported >> this error. Committed memory seems to be around 10%, so this doesn't seem >> to be an overcommit issue. >> >> The hosts which crashed in this fashion freed large amounts of disk after >> they came back up. This returned them to the usage that we'd expect. >> >> Manually restarting Kafka on a broker likewise resulted in its disk usage >> dropping to the 4d retention level. >> >> Other brokers' disk usage seems to have stabilized. >> >> I've spent some time searching for bugs in the Jira or other posts which >> describe this behavior, but have come up empty. >> >> *Questions*: >> >> - Has anyone else seen an issue similar to this? >> - What are some ways that we could confirm whether Kafka is failing >> to clear expired logs from disk? >> - What could cause the mmap failures that we saw? >> - Would it be helpful for us to file a Jira issue or issues for this, >> and what details should we include if so? >> >> Cheers, >> Nicholas Feinberg >> >