Thanks all for your feedback and sorry for the late answer (I was on holiday).
1. Indeed it would add 4 threads. For the non pekko component we can indeed have one watcher service used to reload SSLContext for those components For pekko this is a little more challenging as the creation of the pekko ssl engine is managed by pekko himself. Flink only generates appropriate config with class to execute to initiate the pekko ssl engine [1]. This means that I will not be able to provide the watcher service to this ssl engine. One solution would be to rely on a singleton instead of a service injected in each component but I'm not sure this is fine to use such in flink. WDYT? We can also add a specific flink configuration (security.ssl.internal.keystore.reload.enable) to only add this watcher mechanism if the config is enabled to avoid adding those threads if this is not needed. 2. I'm fine with the LocalFSWatchService naming. 3. Also agree that some e2e tests must be added. @doguscan namal, Thanks for challenging the proposal. FYI, we are planning to rely on certificates with really short validity. D1. It seems that the proposal to rely on a reload period will still face potential issues: - receiving events while file content updates are still in progress: There is no guarantee that we will not load the certificate while file content updates are still in progress - while it will not be affected by multiple notifications, we can reach a point where only the truststore is updated when the reload happens. Which means that if the keystore is updated just after, it will not be taken in account before next run of the reload mechanism I think that with WatchService and appropriate reload grace period mechanism we should be able to mitigate those 2 issues (ensuring minimum reload even with multiple notify) >From KIP-1119 [2] it looks like the same kind of requirements is under discussion for Kafka to also rely on the WatchService Java API (SpringBoot seems to also rely on this API to manage ssl reload [3]). D3. Do we have a real use case for this? [1] https://github.com/apache/flink/blob/b523264ab45d37cd9584a0e8c06f1ef6bd1aaed7/flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/pekko/PekkoUtils.java#L372 [2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-1119%3A+Add+support+for+SSL+auto+reload [3] https://docs.spring.io/spring-boot/reference/features/ssl.html#features.ssl.reloading Regards, Nicolas On Wed, Apr 23, 2025 at 12:14 PM Doğuşcan Namal <namal.dogus...@gmail.com> wrote: > Hi Nicolas, thanks for the FLIP. > > I am fully supportive of the motivation and we should be supporting this > feature. Here are couple of comments from my side: > > D1) > Since you shared the implementation details on the FLIP as well, I would > like to discuss whether using Java's WatchService is the best choice here. > > I believe that the certificate renewal is not a frequent operation. Even > once a day certificate renewals are not realistic(except for test cases > maybe) but let's assume that this covers up the p99.99 of the use cases. I > am confident on this estimation since there hasn't been a request from the > community for this feature so far, which confirms that people were okay > with infrequent cluster restarts. Following that it is infrequent, I > believe that spawning up a thread that watches the file modification > operations is not the best use of the limited resources on a cluster. > > There are some known limitations of the WatchService as well such as > receiving multiple modification events for the same occurence [1], > inotify's (WatchService's underlying mechanism in Linux environments) > problems on containerized environments due to remote file systems [2] or > receiving events while file content updates are still in progress. [3]. I > do not know if these limitations are addressed in the newer versions but > regardless of that it is clear that we may face with some ugly edge cases > due to that. > > Given these complications, I would recommend just creating a new SSLContext > after a configured duration is expired. We could record the timestamp when > the previous SSLContext is created and update it after a configured > duration is passed. This will be much easier to test and reason about when > it is running on production. This will eliminate the necessity to reason > about the file modification operations as well. > > I briefly skimmed through the classes that need to be modified and it > looked feasible for me. Let me know what are your comments on these. > > Note that this is already used in the Kafka world where a new SSLContext is > created after 12 hours. > > D2) We could provide a configuration to the user, such as > "security.ssl.internal.keystore.reload.duration" so they could decide how > often the new certificates should be loaded. > > D3) > On the other hand, I wonder whether we should also handle supporting > updating the file paths of the truststores and keystores under this FLIP as > well. Since the name of the FLIP is "Handle TLS Certificate Renewal" I > think we could bring that into scope too :) > > Thanks > > [1] > > https://stackoverflow.com/questions/16777869/java-7-watchservice-ignoring-multiple-occurrences-of-the-same-event > > [2] https://blog.arkey.fr/2019/09/13/watchservice-and-bind-mount/ > > [3] https://surajatreyac.github.io/2014-07-29/reactive_file_handling.html > > [4] See also Platform Dependencies - > > https://docs.oracle.com/javase/8/docs/api/index.html?java/nio/file/WatchService.html > > > On Fri, 18 Apr 2025 at 18:25, Gabor Somogyi <gabor.g.somo...@gmail.com> > wrote: > > > Hi Robert, > > > > Since I've added the same feature to the operator I'll take a look at it. > > Though it won't be lightning fast since I'm having several weeks off. > > > > Your questions are valid especially considering the fact that this > feature > > touches the hearth of the authentication so this must be rock solid > > in order to avoid grey hair :) > > > > (1) I would vote on a single service which is heavily unit tested with > > all the possible combinations including threading. > > > > Some standalone app could be added to really play with it (that would > help > > review). > > I mean, create X files, start Y threads, and make assertions. > > The reason why I'm suggesting it is the fact that AFAIR the watch service > > is quite sensitive even in single thread. If we could do this in a finite > > time > > consuming unit test then it's even better. > > > > (2) +1 on that name to avoid confusion > > > > (3) I agree that some e2e is must, however this can be easily and deeply > > unit > > tested so that part is also essential. One key test here is when > > certificates > > are not changing then no action must be performed (not to break the whole > > system apart). > > > > Purely personal opinion but such feature developments are slow by nature > > because > > of edge case / stress testing. > > > > BR, > > G > > > > > > On Fri, Apr 18, 2025 at 4:53 PM Robert Metzger <rmetz...@apache.org> > > wrote: > > > > > Hi Nicolas, > > > > > > This looks like a nice improvement, thanks for the write up. > > > Are you in touch with any committer who's willing to review / merge > this? > > > > > > Some random questions on the FLIP: > > > (1) "Each service that depends on TLS certificates will initialize a > > > FileSytemWatchService" > > > > > > It seems that there are 4 components using SSL, does this mean there > will > > > be 4 additional threads running, watching the same set of files? > > > Wouldn't it be better to introduce a central file watching service, and > > SSL > > > users can subscribe to updates, to reduce the number of threads? > > > If this makes the whole effort 4x more complicated, I wouldn't consider > > it, > > > but if its roughly the same effort, we should :) > > > > > > (2) "FileSytemWatchService" > > > When I read this name, I was wondering, whether this is somehow related > > to > > > the Flink "FileSystem" classes. Which I think its' not. > > > Maybe a different name, that makes this separation more explicit, would > > > make sense. Maybe "LocalFSWatchService"? > > > (I'm sorry to bring up naming stuff -- its very subjective, and > > difficult) > > > > > > (3) For the test plan: There seem to be some SSL related e2e tests: > > > > > > > > > https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/common_ssl.sh > > > It would be nice to extend them to cover this feature as well. I would > > hate > > > for this feature to slowly break by future changes, so good e2e test > > > coverage is key, in particular bc so many components are involved. > > > > > > Best, > > > Robert > > > > > > On Wed, Apr 16, 2025 at 11:55 AM Nicolas Fraison > > > <nicolas.frai...@datadoghq.com.invalid> wrote: > > > > > > > Hi All, > > > > > > > > I'd like to start a discussion to Handle TLS Certificate Renewal > > > > Please provide some feedback on this proposal: > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-523%3A+Handle+TLS+Certificate+Renewal > > > > > > > > Regards, > > > > > > > > Nicolas Fraison > > > > > > > > > >