Kafka retention policy per topic not working as expected

Don Ky Thu, 21 May 2020 09:50:07 -0700

Hello Kafka users! I'm posting because we are having an issue with a topic
retention that we don't understand and need input on what could potentially
be wrong on our end if it's indeed is a configuration issue.¬
First we are running a containerized version of Kafka with the following
environment variables.¬
¬
CONFLUENT_MINOR_VERSION=3¬
KAFKA_LOG_CLEANUP_POLICY=compact¬
KAFKA_LOG4J_LOGGERS=kafka.controller=ERROR¬
CONFLUENT_DEB_VERSION=1¬
KAFKA_VERSION=5.3.1¬
CONFLUENT_PATCH_VERSION=1¬
CONFLUENT_VERSION=5.3.1¬
KAFKA_LOG_RETENTION_BYTES=-1¬
PYTHON_VERSION=2.7.9-1¬
SCALA_VERSION=2.12¬
CONFLUENT_MAJOR_VERSION=5¬
KAFKA_NUM_PARTITIONS=4¬
ZULU_OPENJDK_VERSION=8=8.38.0.13¬
KAFKA_DEFAULT_REPLICATION_FACTOR=6¬
KAFKA_LOG_RETENTION_HOURS=2147483647¬
¬
I'm listing these vars because they might be helpful for someone with a
keen eye to spot something we are missing. So based on this we should have
a global log clean up/retention of:¬
¬
KAFKA_LOG_CLEANUP_POLICY of compact~¬
KAFKA_LOG_RETENTION_HOURS=2147483647 = 214 years~¬
KAFKA_LOG_RETENTION_BYTES=-1~¬
¬
We currently have a topic called "ETD-TEST" shown below:¬
¬
kafka-topics --describe --zookeeper $PRD:2181 --topic ETD-TEST¬
Topic: ETD-TEST▸--PartitionCount: 4▸ReplicationFactor: 6▸Configs:
cleanup.policy=delete,retention.ms=1210000000¬
▸--Topic: ETD-TEST▸--Partition: 0▸--Leader: 1▸--Replicas:
1,4,5,2,3,6▸--Isr: 5,1,6,2,3,4¬
▸--Topic: ETD-TEST▸--Partition: 1▸--Leader: 1▸--Replicas:
1,2,3,6,5,4▸--Isr: 5,1,6,2,3,4¬
▸--Topic: ETD-TEST▸--Partition: 2▸--Leader: 4▸--Replicas:
4,5,1,6,2,3▸--Isr: 5,1,6,2,3,4¬
▸--Topic: ETD-TEST▸--Partition: 3▸--Leader: 1▸--Replicas:
1,3,6,4,2,5▸--Isr: 5,1,6,2,3,4¬
¬
Notice it has a clean up policy of delete which should override the global
setting of compact?¬
¬
The issue is ETD-TEST has a topic level retention.ms value of 1210000000
(14 days) but we have messages further in the topic dating much further
back. Somewhere around 8 months. We only ran into this and were surprised
when we started consuming the messages and saw close to 2 million
messages.~¬
¬
>From the reading and research I've done in the last few days retention.ms
isn't necessarily how old messages/events in the topic should be but a
value of how long the segment files in the topic are kept around before
getting rotated out. (Please correct if this is false)¬
¬
I learned that the segment files rotated also have a size + duration value.
The defaults being 1G for segment size (if a segment reaches this size it
gets rotated similar to logs. The segment.ms determines how old a message
should be also before it gets rotated the default being 7 days. So if a log
segment never reaches size it will eventually hit the segment.ms of 7 days
and be rotated.¬
¬
The weird thing is this seems to only affect this one particular topic,
I've created a separate topic and flooded it with data and set up similar
retention/clean up policies and they seem to be deleting fine.¬
¬
What adds to the mystery is here is our output of the segment files of one
of the brokers-0 in this example.¬


$ ls -lah¬
total 642M¬
drwxr-xr-x   2 root root 4.0K May 15 17:28 .¬
drwxr-xr-x 281 root root  12K May 21 12:38 ..¬
-rw-r--r--   1 root root  16K May  9 08:03 00000000000000000000.index¬
-rw-r--r--   1 root root 277M May  9 08:03 00000000000000000000.log¬
-rw-r--r--   1 root root  23K May  9 08:03 00000000000000000000.timeindex¬
-rw-r--r--   1 root root  63K May 14 22:12 00000000000000507838.index¬
-rw-r--r--   1 root root 131M May 14 22:00 00000000000000507838.log¬
-rw-r--r--   1 root root   10 May  9 08:03 00000000000000507838.snapshot¬
-rw-r--r--   1 root root  94K May 14 22:12 00000000000000507838.timeindex¬
-rw-r--r--   1 root root  10M May 21 12:38 00000000000000518083.index¬
-rw-r--r--   1 root root 170M May 21 12:38 00000000000000518083.log¬
-rw-r--r--   1 root root   10 May 14 22:12 00000000000000518083.snapshot¬
-rw-r--r--   1 root root  10M May 21 12:38 00000000000000518083.timeindex¬
-rw-r--r--   1 root root  383 May 14 10:34 leader-epoch-checkpoint¬
¬
We see the segment files are relatively small and there are files in there
since May 9, 2020, it's currently May, 21 2020 as I compose this email.¬
¬
Inspecting the container logs themselves for this one particular topic and
see this:¬
¬
{"log":"[2020-04-25 12:04:25,685] INFO Deleted log
/usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000000000.log.deleted.
(kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.685214246Z"}¬
{"log":"[2020-04-25 12:04:25,685] INFO Deleted offset index
/usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000000000.index.deleted.
(kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.685370948Z"}¬
{"log":"[2020-04-25 12:04:25,685] INFO Deleted time index
/usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000000000.timeindex.deleted.
(kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.68563615Z"}¬
{"log":"[2020-04-25 12:04:25,686] INFO Deleted log
/usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000460418.log.deleted.
(kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.686287357Z"}¬
{"log":"[2020-04-25 12:04:25,686] INFO Deleted offset index
/usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000460418.index.deleted.
(kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.686376758Z"}¬
{"log":"[2020-04-25 12:04:25,686] INFO Deleted time index
/usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000460418.timeindex.deleted.
(kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.686436058Z"}¬
{"log":"[2020-04-25 12:04:25,688] INFO Deleted log
/usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000471726.log.deleted.
(kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.688237176Z"}¬
{"log":"[2020-04-25 12:04:25,688] INFO Deleted offset index
/usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000471726.index.deleted.
(kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.688251276Z"}¬
{"log":"[2020-04-25 12:04:25,688] INFO Deleted time index
/usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000471726.timeindex.deleted.
(kafka.log.LogSegment)\n","stream":"stdout","time":"2020-04-25T12:04:25.688332877Z"}¬
{"log":"[2020-05-09 12:03:35,810] INFO Cleaner 0: Cleaning segment 0 in log
ETD-TEST-0 (largest timestamp Sat Apr 25 11:41:13 UTC 2020) into 0,
retaining deletes.
(kafka.log.LogCleaner)\n","stream":"stdout","time":"2020-05-09T12:03:35.810676351Z"}¬
{"log":"[2020-05-09 12:03:37,924] INFO Cleaner 0: Cleaning segment 483496
in log ETD-TEST-0 (largest timestamp Sat May 02 12:00:29 UTC 2020) into 0,
retaining deletes.
(kafka.log.LogCleaner)\n","stream":"stdout","time":"2020-05-09T12:03:37.924653038Z"}¬
{"log":"[2020-05-09 12:03:38,001] INFO Cleaner 0: Cleaning segment 495708
in log ETD-TEST-0 (largest timestamp Sat May 09 12:03:28 UTC 2020) into 0,
retaining deletes.
(kafka.log.LogCleaner)\n","stream":"stdout","time":"2020-05-09T12:03:38.001662613Z"}¬
{"log":"[2020-05-09 12:04:40,354] INFO Deleted log
/usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000000000.log.deleted.
(kafka.log.LogSegment)\n","stream":"stdout","time":"2020-05-09T12:04:40.355179461Z"}¬
{"log":"[2020-05-09 12:04:40,355] INFO Deleted offset index
/usr/data/kafka/kafka-logs/ETD-TEST-0/00000000000000000000.index.deleted.
(kafka.log.LogSegment)\n","stream":"stdout","time":"2020-05-09T12:04:40.355208261Z"}¬
¬
What's weird the log cleaner says it's deleting the logs but they don't
appear to be? Also what's the deal with the retaining deletes line?¬
¬
So based on this, can anyone let us know if there's anything funky with our
configurations or is this potentially a bug in Kafka?¬
¬
Thanks for any input!

Kafka retention policy per topic not working as expected

Reply via email to