Hello, One possible drawback you may want to consider is that the proposed scenario will not work on a DR cluster maintained by MirrorMaker. Other than that it sounds interesting.
Best regards, Radu On Fri, Jan 24, 2025 at 12:49 PM Jan Wypych <jan.wyp...@billennium.com.invalid> wrote: > Hello, > > We are currently designing a system that will ingest some XML messages and > then it will store them into some kind of long-term storage (years). The > access pattern to data shows that new messages (1-2 weeks old) will be > frequent, older data will be accessed rarely. > We currently chose Kafka as an ingest part, some kind of S3 for cold > long-term, but we are still thinking how we should approach hot storage > (1-2 weeks). We established that our S3 for hot data is too slow. > We have a few options for this hot part of a storage, but one of them is > Kafka (it will greatly simplify the whole system and Kafka reliability if > extremely high). > Each Kafka message can be accessed using the offset/partition pair (we > need some metadata from messages anyway, so getting this pair is free for > us). Kafka stores its data in segments, each of them has its own index, so > we do not do a full scan of a topic. Consumer configs can be tweaked, so we > do not prefetch more than one message, do not commit offsets for consumer > group etc. Our initial tests show very promising results with high > throughput and low latency (3 brokers, 300GB in 50 partitions, 10k > messages/s, average latency under 3ms). Everything we have seen so far > tells us that it should work. > However, this goes against the common understanding of Kafka usage, as a > streaming solution. We searched the internet and could not find such use > case deployed. > On the other hand, every time we found someone discouraging such use case, > there was no technical explanation behind it. Just a vague "Kafka was not > crated for this, better to use X". > So, my question to you is: > Does anybody see any technical reason why our approach (fetch messages by > offset/partition in random order) should not work? Is there some limitation > we do not see, that could bite us in production (10-20 TB of data in > topics, more than 3 brokers obviously)? > > Best regards, > Jan Wypych >