Hello,

One possible drawback you may want to consider is that the proposed
scenario will not work on a DR cluster maintained by MirrorMaker. Other
than that it sounds interesting.

Best regards,
Radu

On Fri, Jan 24, 2025 at 12:49 PM Jan Wypych
<jan.wyp...@billennium.com.invalid> wrote:

> Hello,
>
> We are currently designing a system that will ingest some XML messages and
> then it will store them into some kind of long-term storage (years). The
> access pattern to data shows that new messages (1-2 weeks old) will be
> frequent, older data will be accessed rarely.
> We currently chose Kafka as an ingest part, some kind of S3 for cold
> long-term, but we are still thinking how we should approach hot storage
> (1-2 weeks). We established that our S3 for hot data is too slow.
> We have a few options for this hot part of a storage, but one of them is
> Kafka (it will greatly simplify the whole system and Kafka reliability if
> extremely high).
> Each Kafka message can be accessed using the offset/partition pair (we
> need some metadata from messages anyway, so getting this pair is free for
> us). Kafka stores its data in segments, each of them has its own index, so
> we do not do a full scan of a topic. Consumer configs can be tweaked, so we
> do not prefetch more than one message, do not commit offsets for consumer
> group etc. Our initial tests show very promising results with high
> throughput and low latency (3 brokers, 300GB in 50 partitions, 10k
> messages/s, average latency under 3ms). Everything we have seen so far
> tells us that it should work.
> However, this goes against the common understanding of Kafka usage, as a
> streaming solution. We searched the internet and could not find such use
> case deployed.
> On the other hand, every time we found someone discouraging such use case,
> there was no technical explanation behind it. Just a vague "Kafka was not
> crated for this, better to use X".
> So, my question to you is:
> Does anybody see any technical reason why our approach (fetch messages by
> offset/partition in random order) should not work? Is there some limitation
> we do not see, that could bite us in production (10-20 TB of data in
> topics, more than 3 brokers obviously)?
>
>  Best regards,
> Jan Wypych
>

Reply via email to