[ https://issues.apache.org/jira/browse/KAFKA-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jun Rao updated KAFKA-1063: --------------------------- Fix Version/s: (was: 0.8.1) 0.9.0 > run log cleanup at startup > -------------------------- > > Key: KAFKA-1063 > URL: https://issues.apache.org/jira/browse/KAFKA-1063 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.8.0 > Reporter: paul mackles > Assignee: Neha Narkhede > Priority: Minor > Fix For: 0.9.0 > > > Jun suggested I file this ticket to have the brokers start running cleanup at > start. Here is the scenario that precipitated it: > We ran into a situation on our dev cluster (3 nodes, v0.8) where we ran out > of disk on one of the nodes . As expected, the broker shut itself down and > all of the clients switched over to the other nodes. So far so good. > To free up disk space, I reduced log.retention.hours to something more > manageable (from 172 to 12). I did this on all 3 nodes. Since the other 2 > nodes were running OK, I first tried to restart the node which ran out of > disk. Unfortunately, it kept shutting itself down due to the full disk. From > the logs, I think this was because it was trying to sync-up the replicas it > was responsible for and of course couldn't due to the lack of disk space. My > hope was that upon restart, it would see the new retention settings and free > up a bunch of disk space before trying to do any syncs. > I then went and restarted the other 2 nodes. They both picked up the new > retention settings and freed up a bunch of storage as a result. I then went > back and tried to restart the 3rd node but to no avail. It still had problems > with the full disks. > I thought about trying to reassign partitions so that the node in question > had less to manage but that turned out to be a hassle so I wound up manually > deleting some of the old log/segment files. The broker seemed to come back > fine after that but that's not something I would want to do on a production > server. > We obviously need better monitoring/alerting to avoid this situation > altogether, but I am wondering if the order of operations at startup > could/should be changed to better account for scenarios like this. Or maybe a > utility to remove old logs after changing ttl? Did I miss a better way to > handle this? > Original email thread is here: > http://mail-archives.apache.org/mod_mbox/kafka-users/201309.mbox/%3cce6365ae.82d66%25pmack...@adobe.com%3e -- This message was sent by Atlassian JIRA (v6.1.5#6160)