Hello Kafka users! I'm posting because we are having an issue with a topic retention that we don't understand and need input on what could potentially be wrong on our end if it's indeed is a configuration issue.¬ First we are running a containerized version of Kafka with the following environment variables.¬ ¬ CONFLUENT_MINOR_VERSION=3¬ KAFKA_LOG_CLEANUP_POLICY=compact¬ KAFKA_LOG4J_LOGGERS=kafka.controller=ERROR¬ CONFLUENT_DEB_VERSION=1¬ KAFKA_VERSION=5.3.1¬ CONFLUENT_PATCH_VERSION=1¬ CONFLUENT_VERSION=5.3.1¬ KAFKA_LOG_RETENTION_BYTES=-1¬ PYTHON_VERSION=2.7.9-1¬ SCALA_VERSION=2.12¬ CONFLUENT_MAJOR_VERSION=5¬ KAFKA_NUM_PARTITIONS=4¬ ZULU_OPENJDK_VERSION=8=8.38.0.13¬ KAFKA_DEFAULT_REPLICATION_FACTOR=6¬ KAFKA_LOG_RETENTION_HOURS=2147483647¬ ¬ I'm listing these vars because they might be helpful for someone with a keen eye to spot something we are missing. So based on this we should have a global log clean up/retention of:¬ ¬ KAFKA_LOG_CLEANUP_POLICY of compact~¬ KAFKA_LOG_RETENTION_HOURS=2147483647 = 214 years~¬ KAFKA_LOG_RETENTION_BYTES=-1~¬ ¬ We currently have a topic called "ETD-TEST" shown below:¬ ¬ kafka-topics --describe --zookeeper $PRD:2181 --topic ETD-TEST¬ Topic: ETD-TEST▸--PartitionCount: 4▸ReplicationFactor: 6▸Configs: cleanup.policy=delete,retention.ms=1210000000¬ ▸--Topic: ETD-TEST▸--Partition: 0▸--Leader: 1▸--Replicas: 1,4,5,2,3,6▸--Isr: 5,1,6,2,3,4¬ ▸--Topic: ETD-TEST▸--Partition: 1▸--Leader: 1▸--Replicas: 1,2,3,6,5,4▸--Isr: 5,1,6,2,3,4¬ ▸--Topic: ETD-TEST▸--Partition: 2▸--Leader: 4▸--Replicas: 4,5,1,6,2,3▸--Isr: 5,1,6,2,3,4¬ ▸--Topic: ETD-TEST▸--Partition: 3▸--Leader: 1▸--Replicas: 1,3,6,4,2,5▸--Isr: 5,1,6,2,3,4¬ ¬ Notice it has a clean up policy of delete which should override the global setting of compact?¬ ¬ The issue is ETD-TEST has a topic level retention.ms value of 1210000000 (14 days) but we have messages further in the topic dating much further back. Somewhere around 8 months. We only ran into this and were surprised when we started consuming the messages and saw close to 2 million messages.~¬ ¬ >From the reading and research I've done in the last few days retention.ms isn't necessarily how old messages/events in the topic should be but a value of how long the segment files in the topic are kept around before getting rotated out. (Please correct if this is false)¬ ¬ I learned that the segment files rotated also have a size + duration value. The defaults being 1G for segment size (if a segment reaches this size it gets rotated similar to logs. The segment.ms determines how old a message should be also before it gets rotated the default being 7 days. So if a log segment never reaches size it will eventually hit the segment.ms of 7 days and be rotated.¬ ¬ The weird thing is this seems to only affect this one particular topic, I've created a separate topic and flooded it with data and set up similar retention/clean up policies and they seem to be deleting fine.¬ ¬ What adds to the mystery is here is our output of the segment files of one of the brokers-0 in this example.¬
$ ls -lah¬ total 642M¬ drwxr-xr-x 2 root root 4.0K May 15 17:28 .¬ drwxr-xr-x 281 root root 12K May 21 12:38 ..¬ -rw-r--r-- 1 root root 16K May 9 08:03 00000000000000000000.index¬ -rw-r--r-- 1 root root 277M May 9 08:03 00000000000000000000.log¬ -rw-r--r-- 1 root root 23K May 9 08:03 00000000000000000000.timeindex¬ -rw-r--r-- 1 root root 63K May 14 22:12 00000000000000507838.index¬ -rw-r--r-- 1 root root 131M May 14 22:00 00000000000000507838.log¬ -rw-r--r-- 1 root root 10 May 9 08:03 00000000000000507838.snapshot¬ -rw-r--r-- 1 root root 94K May 14 22:12 00000000000000507838.timeindex¬ -rw-r--r-- 1 root root 10M May 21 12:38 00000000000000518083.index¬ -rw-r--r-- 1 root root 170M May 21 12:38 00000000000000518083.log¬ -rw-r--r-- 1 root root 10 May 14 22:12 00000000000000518083.snapshot¬ -rw-r--r-- 1 root root 10M May 21 12:38 00000000000000518083.timeindex¬ -rw-r--r-- 1 root root 383 May 14 10:34 leader-epoch-checkpoint¬ ¬ We see the segment files are relatively small and there are files in there since May 9, 2020, it's currently May, 21 2020 as I compose this email.¬ ¬ Inspecting the container logs themselves for this one particular topic and see this:¬ ¬ {"log":"[2020-04-25 12:04:25,685] INFO Deleted log /usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000000000.log.deleted. (kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.685214246Z"}¬ {"log":"[2020-04-25 12:04:25,685] INFO Deleted offset index /usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000000000.index.deleted. (kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.685370948Z"}¬ {"log":"[2020-04-25 12:04:25,685] INFO Deleted time index /usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000000000.timeindex.deleted. (kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.68563615Z"}¬ {"log":"[2020-04-25 12:04:25,686] INFO Deleted log /usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000460418.log.deleted. (kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.686287357Z"}¬ {"log":"[2020-04-25 12:04:25,686] INFO Deleted offset index /usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000460418.index.deleted. (kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.686376758Z"}¬ {"log":"[2020-04-25 12:04:25,686] INFO Deleted time index /usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000460418.timeindex.deleted. (kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.686436058Z"}¬ {"log":"[2020-04-25 12:04:25,688] INFO Deleted log /usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000471726.log.deleted. (kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.688237176Z"}¬ {"log":"[2020-04-25 12:04:25,688] INFO Deleted offset index /usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000471726.index.deleted. (kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.688251276Z"}¬ {"log":"[2020-04-25 12:04:25,688] INFO Deleted time index /usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000471726.timeindex.deleted. (kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.688332877Z"}¬ {"log":"[2020-05-09 12:03:35,810] INFO Cleaner 0: Cleaning segment 0 in log ETD-TEST-0 (largest timestamp Sat Apr 25 11:41:13 UTC 2020) into 0, retaining deletes. (kafka.log.LogCleaner)\n","stream":"stdout","time":"2020-05-09T12:03:35.810676351Z"}¬ {"log":"[2020-05-09 12:03:37,924] INFO Cleaner 0: Cleaning segment 483496 in log ETD-TEST-0 (largest timestamp Sat May 02 12:00:29 UTC 2020) into 0, retaining deletes. (kafka.log.LogCleaner)\n","stream":"stdout","time":"2020-05-09T12:03:37.924653038Z"}¬ {"log":"[2020-05-09 12:03:38,001] INFO Cleaner 0: Cleaning segment 495708 in log ETD-TEST-0 (largest timestamp Sat May 09 12:03:28 UTC 2020) into 0, retaining deletes. (kafka.log.LogCleaner)\n","stream":"stdout","time":"2020-05-09T12:03:38.001662613Z"}¬ {"log":"[2020-05-09 12:04:40,354] INFO Deleted log /usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000000000.log.deleted. (kafka.log.LogSegment)\n","stream":"stdout","time":"2020-05-09T12:04:40.355179461Z"}¬ {"log":"[2020-05-09 12:04:40,355] INFO Deleted offset index /usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000000000.index.deleted. (kafka.log.LogSegment)\n","stream":"stdout","time":"2020-05-09T12:04:40.355208261Z"}¬ ¬ What's weird the log cleaner says it's deleting the logs but they don't appear to be? Also what's the deal with the retaining deletes line?¬ ¬ So based on this, can anyone let us know if there's anything funky with our configurations or is this potentially a bug in Kafka?¬ ¬ Thanks for any input!