Hi Jan,

One immediate concern (which you probably have thought through) is that you
will have to be careful with configuring Kafka's cleanup policy. From my
understanding, in "delete" cleanup mode, Kafka calculates which messages to
delete based on a combination of the message's age and the ratio of
used-to-total available disk space (corrections welcome!). Given that, it
might be fussy to adjust the cleanup policy s.t. you get your 2-week
retention. Since it sounds like you're basically using Kafka as a key-value
caching layer, maybe using something like Redis (Elasticache on AWS) with a
TTL would better suit your use case? You would get a definite guarantee on
message persistence time. You would also be able to choose the key
generation policy, instead of (what I'm guessing you're doing now) keeping
an external index of fingerprint-to-partition:offset mappings.

If, on the other hand, you're looking for a cache whose persistence varies
as available disk space shrinks and grows, this is indeed a creative use
for Kafka, and I would also be interested to hear of any technical
objections.

Cheers,
Malcolm


On Fri, Jan 24, 2025 at 3:05 AM Ömer Şiar Baysal <osiarbay...@gmail.com>
wrote:

> Hi,
>
> The data you gathered shows promising results,  one thing the consider is
> testing how the Page Cache that Kafka utilizes affect the response times,
> which greatly improves response time for the fetch requests that are
> already in the cache since it is stored in memory and may give an
> impression that all the fetch requests performance would be the same, it is
> in fact would be different for non-cached data.
>
> Good luck and let me know if you need more information about page cache.
> Omer Siar Baysal
>
>
> On Fri, Jan 24, 2025, 11:48 Jan Wypych <jan.wyp...@billennium.com.invalid>
> wrote:
>
> > Hello,
> >
> > We are currently designing a system that will ingest some XML messages
> and
> > then it will store them into some kind of long-term storage (years). The
> > access pattern to data shows that new messages (1-2 weeks old) will be
> > frequent, older data will be accessed rarely.
> > We currently chose Kafka as an ingest part, some kind of S3 for cold
> > long-term, but we are still thinking how we should approach hot storage
> > (1-2 weeks). We established that our S3 for hot data is too slow.
> > We have a few options for this hot part of a storage, but one of them is
> > Kafka (it will greatly simplify the whole system and Kafka reliability if
> > extremely high).
> > Each Kafka message can be accessed using the offset/partition pair (we
> > need some metadata from messages anyway, so getting this pair is free for
> > us). Kafka stores its data in segments, each of them has its own index,
> so
> > we do not do a full scan of a topic. Consumer configs can be tweaked, so
> we
> > do not prefetch more than one message, do not commit offsets for consumer
> > group etc. Our initial tests show very promising results with high
> > throughput and low latency (3 brokers, 300GB in 50 partitions, 10k
> > messages/s, average latency under 3ms). Everything we have seen so far
> > tells us that it should work.
> > However, this goes against the common understanding of Kafka usage, as a
> > streaming solution. We searched the internet and could not find such use
> > case deployed.
> > On the other hand, every time we found someone discouraging such use
> case,
> > there was no technical explanation behind it. Just a vague "Kafka was not
> > crated for this, better to use X".
> > So, my question to you is:
> > Does anybody see any technical reason why our approach (fetch messages by
> > offset/partition in random order) should not work? Is there some
> limitation
> > we do not see, that could bite us in production (10-20 TB of data in
> > topics, more than 3 brokers obviously)?
> >
> >  Best regards,
> > Jan Wypych
> >
>

Reply via email to