Radim
On 11. 05. 23 13:47, Divij Vaidya wrote:
Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Hey Radim One of the reasons for the slowdown is preparation of upcoming releases (the community is currently in code freeze/resolve release blockers mode) and preparation for Kafka Summit next week. I would suggest giving another 2-3 weeks for folks to chime in. I would personally visit this KIP in the last week of May. -- Divij Vaidya On Thu, May 11, 2023 at 1:34 PM Radim Vansa <rva...@azul.com.invalid> wrote:Hello all, it seems that this KIP did not sparkle much interest, not sure if people just don't care or whether there are any objections against the proposal. What should be the next step, I don't think it has been discussed enough to proceed with voting. Cheers, Radim On 27. 04. 23 8:39, Radim Vansa wrote:Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Thank you for those questions, as I've mentioned, my knowledge of Kafka is quite limited so these are the things that need careful thinking! Comments inline. On 26. 04. 23 16:28, Mickael Maison wrote:Hi Radim, Thanks for the KIP! CRaC is an interesting project and it could be a useful feature in Kafka clients. The KIP is pretty vague in terms of the expected behavior of clients when checkpointing and restoring. For example: 1. A consumer may have pre-fetched records in memory. When it is checkpointed, its group will rebalance and another consumer will consume the same records. When the initial consumer is restored, will it process its pre-fetched records and basically reprocess record already handled by other consumers?How would the broker (?) know what records were really consumed? I think that there must be some form of Two Generals Problem. The checkpoint should keep as much of the application untouched as it could. Here, I would expect that the prefetched records would be consumed after the restore operation as if nothing happened. I can imagine this could cause some trouble if the data is dependent on the 'external' world, e.g. other members of the cluster. But I wouldn't break the general guarantees Kafka provides if we can avoid it. We certainly have an option to do the checkpoint more gracefully, deregistering with the group (the checkpoint is effectively blocked by the notification handler). If we're talking about using CRaC for boot speedup this is not that important - when the app is about to be checkpointed it will likely stop processing data anyway. For other use-cases (e.g. live migration) it might matter.2. Producers may have records in-flight or in the producer buffer when they are checkpointed. How do you propose to handle these cases?If there's something in flight we can wait for the acks. Alternatively if the receiver guards against double receive using unique ids/sequence numbers we could resend that after restore. As for the data in the buffer, IMO that can wait until restore.3. Clients may have loaded plugins such as serializers. These plugins may establish network connections too. How are these expected to automatically reconnect when the application is restored?If there's an independent pool of connections, it's up to the plugin author to support CRaC, I doubt there's anything that the generic code could do. Also it's likely that the plugins won't need any extension to the SPI; these would register its handlers independently (if ordering matters there are ways to prioritize one resource over another). Cheers! Radim VansaThanks, Mickael On Wed, Apr 26, 2023 at 8:27 AM Radim Vansa <rva...@azul.com.invalid> wrote:Hi all, I haven't seen much reactions on this proposal. Is there any general policy regarding dependencies, or a prior decision that would hint on this? Thanks! Radim On 21. 04. 23 10:10, Radim Vansa wrote:Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Thank you, now to be tracked as KIP-921:https://cwiki.apache.org/confluence/display/KAFKA/KIP-921%3A+OpenJDK+CRaC+supportRadim On 20. 04. 23 15:26, Josep Prat wrote:Hi Radim, You should have now permissions to create a KIP. Best, On Thu, Apr 20, 2023 at 2:22 PM Radim Vansa <rva...@azul.com.invalid wrote:Hello, upon filing a PR [1] with some initial support for OpenJDK CRaC [2][3] I was directed here to raise a KIP (I don't have the permissions in wiki/JIRA to create the KIP page yet, though). In a nutshell, CRaC intends to provide a way to checkpoint (snapshot) and persist a running Java application and later restore it, possibly on a different computer. This can be used to significantly speed up the boot process (from seconds or minutes to tens of milliseconds), live replication or migration of the heated up application. This is not entirely transparent to the application; the application can register for notification when this is happening, and sometime has to assist with that to prevent unexpected state after restore - e.g. close network connections and files. CRaC is not integrated yet into the mainline JDK; JEP is being prepared, and users are welcome to try out our builds. However even when this gets into JDK we can't expect users jump onto the latest release immediately; therefore we provide a facade package org.crac [4] that delegates to the implementation, if it is present in the running JDK, or provides a no-op implementation. With or without the implementation, the support for CRaC in the application should be designed to have a minimal impact on performance (few extra objects, some volatile reads...). On the other hand the checkpoint operation itself can be non-trivial in this matter. Therefore the main consideration should be about the maintenance costs - keeping a small JAR in dependencies and some extra code in networking and persistence. The support for CRaC does not have to be all-in for all components - maybe it does not make sense to snapshot a Broker. My PR was for Kafka Clients because the open network connections need to be handled in a web application (in my case I am enabling CRaC in Quarkus Superheros [5] demo). The PR does not handle all possible client-side uses; as I am not familiar with Kafka I follow the whack-a-mole strategy. It is possible that the C/R could be handled in a different layer, e.g. in Quarkus integration code. However our intent is to push the changes as low in the technology stack as possible, to provide the best fanout to users without duplicating maintenance efforts. Also having the support higher up can be fragile and break encapsulation. Thank you for your consideration, I hope that you'll appreciate our attempt to innovate the Java ecosystem. Radim Vansa PS: I'd appreciate if someone could give me the permissions on wiki to create a proper KIP! Username: rvansa (both Confluence and JIRA). [1] https://github.com/apache/kafka/pull/13619 [2] https://wiki.openjdk.org/display/crac [3] https://github.com/openjdk/crac [4] https://github.com/CRaC/org.crac [5] https://quarkus.io/quarkus-workshops/super-heroes/-- [image: Aiven] <https://www.aiven.io> *Josep Prat* Open Source Engineering Director, *Aiven* josep.p...@aiven.io | +491715557497 aiven.io <https://www.aiven.io> | <https://www.facebook.com/aivencloud> <https://www.linkedin.com/company/aiven/> <https://twitter.com/aiven_io> *Aiven Deutschland GmbH* Alexanderufer 3-7, 10117 Berlin Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen Amtsgericht Charlottenburg, HRB 209739 B