For reference, the cause of this turned out to be a corrupt timeindex file
on the earliest segment in the partition. Although Kafka didn't flag the
files as being corrupt, they clearly weren't correct as they had a filesize
of 12bytes instead of several MB. It was fixed by stopping Kafka, removing
the offending .index and .timeindex files and starting Kafka to trigger an
index rebuild. This immediately triggered deletion of old segments as
expected with the configured retention.ms policy. This had to be repeated
on the partition leader for each affected partition. I'm not sure if there
is a less laborious way of clearing these indexes. Presumably you can just
remove the old segments manually but I wanted to test that the retention
policy was working correctly. The suspected cause of the corruption was an
unclean shutdown.