Hi Stanislav, I already gave a similar suggestion to Henry earlier, and you can see his response here: https://lists.apache.org/thread/v8t7co0517hw2tlm0ypn8tnjfmhnhv83 . Good to see you have the same thought. :)
Thanks. Luke On Wed, May 14, 2025 at 6:31 PM Stanislav Kozlovski < stanislavkozlov...@apache.org> wrote: > Have we considered using the traditional replication path to store the > actual metadata for the topic/partition? > > I know the KIP says "The main purpose of > FollowerFetchRequest/FollowerFetchResponse is now just to update the > offsets and high watermark between leader and follower.", but what if we > add each partition's metadata in the actual partition - i.e an event with a > pointer to the S3 object and the {byte-range,offset-range} of said > partitions' data in said S3 object > > On 2025/05/08 07:57:15 Luke Chen wrote: > > Hi Xinyu and Henry, > > > > I think the WAL metadata in KIP1176 is not for log recover, the log > > recovery still loads log segments locally. > > The WAL metadata is for leader <-> follower information sharing only. Is > my > > understanding correct? > > > > About the WAL metadata, as I mentioned earlier, I still worry about the > > size of it even if we move it to a separate topic. > > Since we don't know when exactly the WAL log segments will be moved to > slow > > cloud storage, we have no way to set a "safe" retention.ms for this > topic. > > Like in current tiered storage, by default we set retention.ms to -1 for > > the remote log metadata topic to avoid data loss. > > But we know the metadata size of KIP-405 VS KIP-1176 will have huge > > differences. > > Suppose the segment size is 1GB, and each request to fast cloud storage > is > > 10KB, the size will be 100,000 times larger in KIP-1176. > > > > I'm thinking, if the WAL metadata is just for notifying followers about > the > > records location in fast cloud storage, could we simplify the WAL > metadata > > management by including them in the fetch response with a special flag > (ex: > > walMetadata=true) in the fetchResponse record instead? Because > > 1. When the followers successfully download the logs from the fast cloud > > storage, the metadata is useless anymore. > > 2. To help some lag behind replicas catch up, these metadata can be > stored > > in local disk under the partition folder in leader and followers nodes. > So > > when the lagged follower fetches some old data in the active log segment, > > the leader can still respond with the metadata to the follower, to let > the > > follower download the logs from fast cloud storage to avoid cross-az > cost. > > 3. If the metadata local file is not found on the leader node, we can > fall > > back to pass the pure logs directly (with cross-az cost for sure, but it > > will be rare). > > 4. The metadata local file won't be uploaded to slow cloud storage and > will > > be deleted after local retention expired. > > 5. Compared with the existing design using __remote_log_metadata topic, > the > > metadata is still needed to be replicated to all replicas, so the > cross-az > > cost is the same. > > > > What do you think about this alternative for WAL metadata? > > > > One more question from me: > > 1. It looks like we only move "logs" to the fast cloud storage, not the > > index files, producer snapshots,...etc. Is that right? > > Because this is different from KIP-405, and it is kind of inherited from > > KIP-405, we should make it clear in the KIP. > > > > > > Thanks. > > Luke > > > > > > > > > > On Thu, May 8, 2025 at 9:54 AM Xinyu Zhou <yu...@apache.org> wrote: > > > > > Hi Henry, > > > > > > Thank you for your detailed reply. The answer makes sense to me, and > you're > > > right, KIP-1176 has a clear and specific scope and is expected to have > a > > > quick path to implement it. > > > > > > I also want to discuss the metadata management of WAL log segments. Is > an > > > internal topic necessary for managing metadata? In AutoMQ, WAL is > solely > > > for recovery and is expected to be uploaded to standard S3 as soon as > > > possible, without metadata management. I think KIP-1176 might not need > it > > > either; during recovery, we can simply scan the WAL to restore the > > > metadata. > > > > > > Regards, > > > Xinyu > > > > > > On Thu, May 8, 2025 at 2:00 AM Henry Haiying Cai > > > <haiying_...@yahoo.com.invalid> wrote: > > > > > > > Xinyu, > > > > Thanks for your time reading the KIP and detailed comments. We are > > > > honored to have technical leaders from AutoMQ to look at our work. > > > > Please see my answers below inline. > > > > > > > > On Tuesday, May 6, 2025 at 08:37:22 PM PDT, Xinyu Zhou < > > > > yu...@apache.org> wrote: > > > > > > > > Hi Henry and Tom, > > > > > > > > I've read the entire KIP-1176, and I think it's a smart move to > advance > > > > tiered storage. > > > > > > > > If I understand correctly, KIP-1176 aims to eliminate cross-AZ > traffic in > > > > tier 1 storage by replicating data to followers through the S3EOZ > bucket. > > > > After that, followers only need to replicate data from the S3EOZ > bucket, > > > > which is free for cross-AZ traffic. > > > > > > > > Based on my understanding, I have some questions: > > > > > > > > 1. Does KIP-1176 focus solely on eliminating cross-AZ traffic from > ISR > > > > replication? Have you considered using S3/S3EOZ to reduce cross-AZ > > > > traffic > > > > from the producer side as well? Actually, AutoMQ has validated and > > > > implemented this solution, you can refer to this pull request: > > > > https://github.com/AutoMQ/automq/pull/2505 > > > > HC> The focus of KIP-1176 is mainly on reducing across-AZ traffic > cost > > > > between brokers which is a big percentage (like 60%) on the broker > side > > > > cost. At the moment, we are focusing only on broker side's cost and > > > > optimize producer/consumer side traffic later. I know there are > efforts > > > > from the community to optimize on AZ traffic between producer and > broker > > > as > > > > well (e.g. KIP-1123), we will get benefit from across-AZ cost savings > > > from > > > > producer side when those efforts materialized. > > > > 2. KIP-1176, like AutoMQ, is a leader-based architecture that > benefits > > > > from using object storage for elastic features, such as quickly > > > > reassigning > > > > partitions. However, KIP-1176 still uses local block storage for > > > managing > > > > active log segments, so its elasticity is similar to current tiered > > > > storage, right? Will KIP-1176 consider enhancing elasticity by > > > utilizing > > > > object storage? Or is this not the scope of KIP-1176? > > > > HC> KIP-1176 is a small KIP which built on existing constructs from > > > tiered > > > > storage and also built on the existing core tenet of Kafka: page > cache. > > > I > > > > know there are other efforts (e.g. KIP-1150 and AutoMQ's solution) > which > > > > proposed revamping Kafka's memory management and storage system by > moving > > > > everything to cloud and built memory/disk caching layers on top of > that, > > > > those are big and audacious efforts which can take years to merge > back > > > into > > > > Apache Kafka. Instead we are focusing on a small and iterative > approach > > > > which can be absorbed into Apache Kafka much easier/quicker while > > > cutting a > > > > big cost portion. Although this KIP is targeting a smaller goal, > but it > > > > can also achieve a bigger goal cloud-native-elasticity if everything > is > > > > moved to cloud storage. KIP-405 moved all closed log segments to > object > > > > storage and this KIP moved active log segment to object storage, now > with > > > > everything on the cloud storage, the consumers now can read directly > from > > > > cloud storage (without connecting to the broker), in this direction > > > > majority of the traffic (consumer traffic probably comprises 2/3 of > the > > > > overall traffic) will be happening outside broker, there are much > less > > > > resources we need to allocate to the broker. > > > > 3. The KIP indicates that the S3EOZ cost isn't significantly low, > with > > > > cross-AZ data transfer fees at $1612 and S3EOZ costs at $648. Many > AWS > > > > customers get substantial discounts on cross-AZ transfer fees, so > the > > > > final > > > > benefit of KIP-1176 might not be significant(I am not sure). Could > you > > > > please share any updates on KIP-1176 in Slack? > > > > > > > > HC>. Yes you are right that big companies (e.g. Slack/Salesforce) get > > > > deeper discount from AWS. Since I cannot share the discount rate > from my > > > > company I can only quote public pricing number. But even with those > > > > discounts, across AZ traffic is still the major cost factor. > > > > Also, I’m concerned about the community. Vendors are keen to move > Kafka > > > to > > > > object storage because cloud, especially AWS, is their main market, > > > making > > > > cross-AZ traffic important. However, Apache Kafka users are spread > across > > > > various environments, including different cloud providers (note that > only > > > > AWS and GCP charge for cross-AZ traffic) and many on-premise data > > > centers. > > > > Where are most self-hosted Kafka users located? Are they deeply > impacted > > > by > > > > cross-AZ traffic costs? How does the community balance these users' > > > > differing needs and weigh expected benefits against architectural > > > > complexity? > > > > > > > > HC> This KIP (KIP-1176) is mainly targeting the same set of users > who is > > > > already using KIP-405: Tiered Storage by extending support of tiered > > > > storage to active log segment. For those users, they will get extra > > > > savings on across-AZ traffic and extra benefit of having everything > on > > > the > > > > cloud storage. I think in US (probably Europe as well), AWS/GCP is > the > > > > majority of the cloud market. > > > > Overall, KIP-1176 is a great idea for using S3EOZ to eliminate > cross-AZ > > > > replication traffic. Well done! > > > > > > > > Disclaimer: I work for AutoMQ, but I am wearing the community hat to > join > > > > this discussion thread. > > > > > > > > Regards, > > > > Xinyu > > > > > > > > On Wed, May 7, 2025 at 9:13 AM Henry Haiying Cai > > > > <haiying_...@yahoo.com.invalid> wrote: > > > > > > > > > Christo, > > > > > In terms of supporting transactional messages, I looked at the > current > > > > > FetchRequest/Response code, looks like for follower fetch it's > always > > > > > fetching to the LOG_END offset (while for consumer fetch there is a > > > > choice > > > > > of fetch up to HIGH_WATERMARK vs fetch up to TXN_COMMITTED) , > since > > > our > > > > > current implementation is to copy all the way to LOG_END between > leader > > > > and > > > > > follower broker (through object storage), it seems it would > naturally > > > > > support replicating transactional messages as well. > > > > > On Tuesday, May 6, 2025 at 12:20:43 PM PDT, Henry Haiying Cai < > > > > > haiying_...@yahoo.com> wrote: > > > > > > > > > > Christo, > > > > > Thanks for your detailed comments and see my answer below inline. > > > > > On Tuesday, May 6, 2025 at 02:40:29 AM PDT, Christo Lolov < > > > > > christolo...@gmail.com> wrote: > > > > > > > > > > Hello! > > > > > > > > > > It is great to see another proposal on the same topic, but > optimising > > > for > > > > > different scenarios, so thanks a lot for the effort put in this! > > > > > > > > > > I have a few questions and statements in no particular order. > > > > > > > > > > If you use acks=-1 (acks=all) then an acknowledgement can only be > sent > > > to > > > > > the producer if and only if the records have been persisted in > > > replicated > > > > > object storage (S3) or non-replicated object storage (S3E1AZ) and > > > > > downloaded on followers. If you do not do this, then you do not > cover > > > the > > > > > following two failure scenarios which Kafka does cover today: > > > > > > > > > > 1. Your leader persists records on disk. Your followers fetch the > > > > metadata > > > > > for these records. The high watermark on the leader advances. The > > > leader > > > > > sends acknowledgement to the producer. The records are not yet put > in > > > > > object storage. The leader crashes irrecoverably before the > records are > > > > > uploaded. > > > > > > > > > > 2. Your leader persists records on disk. Your followers fetch the > > > > metadata > > > > > for these records. The high watermark on the leader advances. The > > > leader > > > > > sends acknowledgement to the producer. The records are put in > > > > > non-replicated object storage, but not downloaded by followers. The > > > > > non-replicated object storage experiences prolonged > unavailability. The > > > > > leader crashes irrecoverably. > > > > > > > > > > In both of these scenarios you risk either data loss or data > > > > unavailability > > > > > if a single replica goes out of commission. As such, this breaks > the > > > > > current definition of acks=-1 (acks=all) to the best of my > knowledge. I > > > > am > > > > > happy to discuss this further if you think this is not the case. > > > > > HC > Our current implementation is to wait until the follower gets > the > > > > > producer data and FollowerState in leader's memory gets updated > through > > > > the > > > > > existing FollowerRequest/Response exchange (to be exact, it is the > > > > > subsequent FollowerRequest/Response after the follower has > appended the > > > > > producer data) before leader can acknowledge back to the producer, > this > > > > way > > > > > we don't have to modify the current implementation of high > watermark > > > and > > > > > follower state sync. So in this implementation, there is no risks > of > > > > data > > > > > loss since follower gets the producer data as in existing code. > The > > > > > drawback is the extra hop from object storage to the follower > broker, > > > it > > > > > can be mitigated by tuning download frequency. We do have a plan > to > > > > > optimize the latency in acks=-1 by acks back to producer as soon > as the > > > > > data is uploaded onto object storage, there is code we need to add > to > > > > deal > > > > > when the old leader crashes and the new leader needs to do fast > catch > > > up > > > > > sync with object storage, we plan to propose this as an performance > > > > > optimization feature fix on top of the current proposal. On your > > > concern > > > > > of follower having the new metadata but not having the new data, > the > > > > > follower gets the data from object storage download and append to > local > > > > log > > > > > and then update its log end offset and its offset state is then > > > > transmitted > > > > > back to the leader broker on the subsequent FetchRequest (similar > to > > > how > > > > it > > > > > was doing today except the process is triggered from > > > > processFetchResponse), > > > > > the log segment metadata the follower is getting from > > > > __remote_log_metadata > > > > > topic is used to trigger the background task to download new data > > > segment > > > > > but not used to build it's local log offsets (e.g. logEndOffset), > local > > > > > log's offset state are built when the data is appended to the > local log > > > > (as > > > > > in the existing Kafka code). > > > > > > > > > > S3E1AZ only resides in 1 availability zone. This poses the > following > > > > > questions: > > > > > a) Will you have 1 bucket per availability zone assuming a 3-broker > > > > cluster > > > > > where each broker is in a separate availability zone? > > > > > HC>. Yes you are right that S3E1Z is only in one AZ. So in our > setup, > > > we > > > > > have the S3E1Z's bucket AZ to be the same as the leader broker's > AZ, > > > and > > > > > the follower broker is from a different AZ. So the data upload > from > > > > leader > > > > > broker to S3E1Z is fast (within the same AZ), the download from > object > > > > > storage to the follower is slower (across AZ), but AWS don't charge > > > extra > > > > > for that download. > > > > > b) If not, then have you ran a test on the network penalty in > terms of > > > > > latency for the 2 brokers not in the same availability zone but > being > > > > > leaders for their respective partitions? Here I am interested to > see > > > what > > > > > 2/3 of any cluster will experience? > > > > > HC>. As I mentioned above, the download from the S31EZ to the > follower > > > is > > > > > slower because the traffic goes across AZ, it adds about 10ms for > > > bigger > > > > > packet. And also in the situation that you mentioned that a > broker has > > > > > some partitions as followers but some partitions as leaders (which > is > > > > > typical in a kafka cluster), we have 3 S3E1Z buckets (one in each > AZ), > > > > when > > > > > the brokers needs to upload data onto S3E1Z for its leader > partitions, > > > it > > > > > will upload to the the bucket in the same AZ as itself. The path > of > > > the > > > > > file including the bucket name is part of the log segment metadata > > > > > published to the __remote_log_metadata topic, when a follower > broker > > > > needs > > > > > to do the download it will use the path of the file (including the > > > bucket > > > > > name) to download, this applies to the situation to that leader > broker > > > > when > > > > > it needs to download for the partitions it act as followers. > > > > > c) On a quick search it isn't clear whether S3E1AZ incurs cross-AZ > > > > > networking data charges (again, in the case where there is only 1 > > > bucket > > > > > for the whole cluster). This might be my fault, but from the table > at > > > the > > > > > end of the KIP it isn't super obvious to me whether the transfer > cost > > > > > includes these network charges. Have you ran a test to see whether > the > > > > > pricing still makes sense? If you have could you share these > numbers in > > > > the > > > > > KIP? > > > > > HC> S3 (including S3E1Z) doesn't charge for across-AZ traffic > (they do > > > > > extra charge if it's across region), but the latency is longer if > the > > > > data > > > > > travels across AZ. S3E1z charges for S3 PUT (upload) and S3 GET > > > > > (download), PUT is usually 10x more expensive than GET. So we > don't > > > pay > > > > > for across AZ traffic cost but we do pay for S3 PUT and GET, so the > > > batch > > > > > size and upload frequency is still important to not overrun the S3 > PUT > > > > > cost. So number still make sense if the batch size and upload > > > frequency > > > > is > > > > > set right. > > > > > > > > > > As far as I understand, this will work in conjunction with Tiered > > > Storage > > > > > as it works today. Am I correct in my reading of the KIP? If I am > > > > correct, > > > > > then how you store data in active segments seems to differ from > how TS > > > > > stores data in closed segments. In your proposal you put multiple > > > > > partitions in the same blob. What and how will move this data back > to > > > the > > > > > old format used by TS? > > > > > HC> Yes we do design to run this active log segment support along > with > > > > the > > > > > current tiered storage. And yes the data stored in the active > segment > > > > > uploaded onto S3E1Z is a bit different than the closed segment > uploaded > > > > > onto S3, mostly for cost reasons (as mentioned above) to combine > the > > > > > content from multiple topic partitions. The upload of active log > > > > segments > > > > > onto S3E1Z and upload of closed segment onto S3 (the current tiered > > > > > storage) are running in parallel on their own. For example, > assume we > > > > set > > > > > local.retention.ms = 1-hour for a tiered-storage-enabled topic, > the > > > > > proposed KIP will upload the sections of batch records from the > active > > > > log > > > > > segment onto S3E1Z when the batch records are appended into the > active > > > > log > > > > > segment on local disk. At some point this active log segment will > be > > > > > closed (when it gets to size or age threshold) and later the > current > > > > tiered > > > > > storage code will upload this closed log segment onto S3 when this > > > > segment > > > > > file is more than 1 hour old. These 2 activities (uploading to > S3E1Z > > > and > > > > > uploading to S3) are independently run, there is no need to > transfer > > > the > > > > > log segment file from S3E1Z to S3. There is no change to the > current > > > > code > > > > > and management of tiered storage for closed segment. > > > > > > > > > > How will you handle compaction? > > > > > HC> We currently only support the normal append-only kafka logs, > > > > compacted > > > > > kafka logs are usually not very big to benefit from this KIP > proposal. > > > > But > > > > > we can look into compacted logs later. > > > > > How will you handle indexes? > > > > > HC>. We only need to upload/download the data segment log onto > S3E1Z, > > > > > various index files are built on the follower's disk when the > follower > > > > > downloads the data and appended onto the local log on follower's > disk > > > > (just > > > > > like the existing code the indexes file are built when the data is > > > > appended > > > > > to log), there is no need to transfer the index files from leader > > > broker > > > > > onto follower broker. This is a bit different than the existing > tiered > > > > > storage implementation for closed log segment where you need all > the > > > > states > > > > > to be stored on object storage, in our proposal the S3E1Z is just > an > > > > > intermediate data hop and we are replacing the follower direct read > > > from > > > > > leader by indirect download from object storage, but we are not > > > changing > > > > > how the index file was built. > > > > > How will you handle transactions? > > > > > HC> The current implementation handles the append-only > log-end-offset > > > > > based sync between leader and follower (those logs tends to be big > and > > > > > benefit from this proposal and this is also the majority of our > > > pipelines > > > > > in our company), we plan to add the support for transactions in > the log > > > > > file later, there might be some extra metadata needs to be > included in > > > > > object storage, but again we are basically replacing the > information > > > > > exchange in the current FetchRequest/Response. > > > > > > > > > > Once again, this is quite exciting, so thanks for the contribution! > > > > > > > > > > Best, > > > > > Christo > > > > > > > > > > On Thu, 1 May 2025 at 19:01, Henry Haiying Cai > > > > > <haiying_...@yahoo.com.invalid> wrote: > > > > > > > > > > > Luke, > > > > > > Thanks for your comments, see my answers below inline. > > > > > > On Thursday, May 1, 2025 at 03:20:54 AM PDT, Luke Chen < > > > > > > show...@gmail.com> wrote: > > > > > > > > > > > > Hi Henry, > > > > > > > > > > > > This is a very interesting proposal! > > > > > > I love the idea to minimize the code change to be able to > quickly get > > > > > > delivered. > > > > > > Thanks for proposing this! > > > > > > > > > > > > Some questions: > > > > > > 1. In this KIP, we add one more tier of storage. That is: local > disk > > > -> > > > > > > fast object store -> slow object store. > > > > > > Why can't we allow users to replace the local disk with the fast > > > object > > > > > > store directly? Any consideration on this? > > > > > > If we don't have the local disk, the follower fetch will be much > > > > > simplified > > > > > > without downloading from the fast object store, is my > understanding > > > > > > correct? > > > > > > HC> The fast object storage is not as fast as local disk, the > data > > > > > latency > > > > > > on fast object storage is going to be in 10ms for big data > packets > > > and > > > > > the > > > > > > local disk append is fast since we only need to append the > records > > > into > > > > > the > > > > > > page cache of the local file (the flush from page cache to disk > is > > > done > > > > > > asynchronously without affecting the main request/reply cycle > between > > > > > > producer and leader broker). This is actually the major > difference > > > > > > between this KIP and KIP-1150, although KIP-1150 can completely > > > > removing > > > > > > the local disk but they are going to have a long latency (their > main > > > > use > > > > > > cases is for customer can tolerate 200ms latency) and they need > to > > > > start > > > > > > build their own memory management and caching strategy since > they are > > > > not > > > > > > using page cache anymore. Our KIP has no latency change > (comparing > > > the > > > > > > current Kafka status) on acks=1 path which I believe is still the > > > > > operating > > > > > > mode for many company's logging pipelines. > > > > > > > > > > > > 2. Will the WALmetadata be deleted after the data in fast object > > > > storage > > > > > is > > > > > > deleted? > > > > > > I'm a little worried about the metadata size in the WALmetadata. > I > > > > guess > > > > > > the __remote_log_metadata topic is stored in local disk only, > right? > > > > > > HC> Currently we are reusing the classes and constructs from > KIP-405, > > > > > e.g. > > > > > > the __remote_log_metadata topic and ConsumerManager and > > > > ProducerManager. > > > > > > As you pointed out the size of segments from active log segments > is > > > > going > > > > > > to be big, our vision is to create a separate metadata topic for > > > active > > > > > log > > > > > > segments then we can have a shorter retention setting for this > topic > > > to > > > > > > remove the segment metadata faster, but we would need to refactor > > > code > > > > in > > > > > > ConsumerManager and ProducerManager to work with 2nd metadata > topic. > > > > > > > > > > > > 3. In this KIP, we assume the fast object store is different > from the > > > > > slow > > > > > > object store. > > > > > > Is it possible we allow users to use the same one? > > > > > > Let's say, we set both fast/slow object store = S3 (some use > cases > > > > > doesn't > > > > > > care about too much on the latency), if we offload the active log > > > > segment > > > > > > onto fast object store (S3), can we not offload the segment to > slow > > > > > object > > > > > > store again after the log segment is rolled? > > > > > > I'm thinking if it's possible we learn(borrow) some ideas from > > > > KIP-1150? > > > > > > This way, we can achieve the similar goal since we accumulate > > > (combine) > > > > > > data in multiple partitions and upload to S3 to save the cost. > > > > > > > > > > > > HC> Of course people can choose just to use S3 for both fast and > slow > > > > > > object storage. They can have the same class implementing both > > > > > > RemoteStorageManager and RemoteWalStorageManager, we proposed > > > > > > RemoteWalStorageManager as a separate interface to give people > > > > different > > > > > > implementation choices. > > > > > > I think KIP-1176 (this one) and KIP-1150 can combine some ideas > or > > > > > > implementations. We mainly focus on cutting AZ transfer cost > while > > > > > > maintaining the same performance characteristics (such as > latency) > > > and > > > > > > doing a smaller evolution of the current Kafka code base. > KIP-1150 > > > is a > > > > > > much ambitious effort with a complete revamp of Kafka storage and > > > > memory > > > > > > management system. > > > > > > Thank you. > > > > > > Luke > > > > > > > > > > > > On Thu, May 1, 2025 at 1:45 PM Henry Haiying Cai > > > > > > <haiying_...@yahoo.com.invalid> wrote: > > > > > > > > > > > > > Link to the KIP: > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1176%3A+Tiered+Storage+for+Active+Log+Segment > > > > > > > Motivation > > > > > > > In KIP-405, the community has proposed and implemented the > tiered > > > > > storage > > > > > > > for old Kafka log segment files, when the log segments is older > > > than > > > > > > > local.retention.ms, it becomes eligible to be uploaded to > cloud's > > > > > object > > > > > > > storage and removed from the local storage thus reducing local > > > > storage > > > > > > > cost. KIP-405 only uploads older log segments but not the most > > > > recent > > > > > > > active log segments (write-ahead logs). Thus in a typical 3-way > > > > > > replicated > > > > > > > Kafka cluster, the 2 follower brokers would still need to > replicate > > > > the > > > > > > > active log segments from the leader broker. It is common > practice > > > to > > > > > set > > > > > > up > > > > > > > the 3 brokers in three different AZs to improve the high > > > availability > > > > > of > > > > > > > the cluster. This would cause the replications between > > > > leader/follower > > > > > > > brokers to be across AZs which is a significant cost (various > > > studies > > > > > > show > > > > > > > the across AZ transfer cost typically comprises 50%-60% of the > > > total > > > > > > > cluster cost). Since all the active log segments are physically > > > > present > > > > > > on > > > > > > > three Kafka Brokers, they still comprise significant resource > usage > > > > on > > > > > > the > > > > > > > brokers. The state of the broker is still quite big during node > > > > > > > replacement, leading to longer node replacement time. KIP-1150 > > > > recently > > > > > > > proposes diskless Kafka topic, but leads to increased latency > and a > > > > > > > significant redesign. In comparison, this proposed KIP > maintains > > > > > > identical > > > > > > > performance for acks=1 producer path, minimizes design changes > to > > > > > Kafka, > > > > > > > and still slashes cost by an estimated 43%. > > > > > > > > > > > > > > > > > > > > > > > > > > > >