Hi All, I've read through the concerns/proposals related the watch service and here are my conclusions: * Watch service can watch local file systems only: It's fair to say that certificates must be copied to local FS in order to work (init container cp command or something) * Watch service can send multiple events even for a single directory change: this can be mitigated with a single atomic dirty flag (see my design suggestion) * Polling file modification time: When I hear any kind of polling implemented by us is just something I'm mostly opposing * security.ssl.internal.keystore.reload.duration: Security features must be executed nearly immediately no matter what
As a general remark, not sure how many users are using the watch service based approach in the operator but until now I've not seen any issue related to that. If somebody is having some specifics then please share. For now I would shoot for keystore not to have feature creep. When there is a valid use-case, community interest and the keystore story is already rock stable then we can consider to involve truststore later. Having the mentioned assumption that the operator approach works, here is my high level proposal: * Let's have enable flag for all such watch functionality. If it's false then the currently existing functionality must remain as-is * Let's have a LocalFSWatchService which is a singleton which has no path registrations by default * Add path registration functionality which is synchronised where a callback can be registered * Let's have a ReloadableSSLContext which implements the mentioned callback * Inside the callback set an atomic dirty flag only (this can handle multiple events for the same directory change + event handling in the watch service must be extreme fast) * Inside ReloadableSSLContext all SSLContext actions must be overridden. At the beginning of each function dirty flag must be checked and if dirty then certificates must be reloaded, flag can be set back to false (then original functionality call). It's extremely important that context reload must be synchronised. A synchronised boolean check + possible context reload can consume some time but I wouldn't expect any significant performance drop. My proposal main drivers: * Original code path must run when no watch service asked * Super fast event handling because million events may come in (not expecting but we should be prepared) * Clean separation between dir/file watch and file/dir usage * Well considered synchronisation model * Extensive unit testing because we're intended to touch the heart of Flink Happy to hear other opinions. BR, G On Mon, Apr 28, 2025 at 1:54 PM Nicolas Fraison <nicolas.frai...@datadoghq.com.invalid> wrote: > Thanks all for your feedback and sorry for the late answer (I was on > holiday). > > 1. Indeed it would add 4 threads. > For the non pekko component we can indeed have one watcher service used to > reload SSLContext for those components > For pekko this is a little more challenging as the creation of the pekko > ssl engine is managed by pekko himself. > Flink only generates appropriate config with class to execute to initiate > the pekko ssl engine [1]. This means that I will not be able to provide the > watcher service to this ssl engine. > One solution would be to rely on a singleton instead of a service injected > in each component but I'm not sure this is fine to use such in flink. > WDYT? > > We can also add a specific flink configuration > (security.ssl.internal.keystore.reload.enable) to only add this watcher > mechanism if the config is enabled to avoid adding those threads if this is > not needed. > > > 2. I'm fine with the LocalFSWatchService naming. > > 3. Also agree that some e2e tests must be added. > > > @doguscan namal, Thanks for challenging the proposal. > FYI, we are planning to rely on certificates with really short validity. > > D1. It seems that the proposal to rely on a reload period will still face > potential issues: > - receiving events while file content updates are still in progress: > There is no guarantee that we will not load the certificate while file > content updates are still in progress > - while it will not be affected by multiple notifications, we can reach a > point where only the truststore is updated when the reload happens. > Which means that if the keystore is updated just after, it will not be > taken in account before next run of the reload mechanism > > I think that with WatchService and appropriate reload grace period > mechanism we should be able to mitigate those 2 issues (ensuring minimum > reload even with multiple notify) > > From KIP-1119 [2] it looks like the same kind of requirements is under > discussion for Kafka to also rely on the WatchService Java API (SpringBoot > seems to also rely on this API to manage ssl reload [3]). > > D3. Do we have a real use case for this? > > [1] > > https://github.com/apache/flink/blob/b523264ab45d37cd9584a0e8c06f1ef6bd1aaed7/flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/pekko/PekkoUtils.java#L372 > > [2] > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1119%3A+Add+support+for+SSL+auto+reload > > [3] > > https://docs.spring.io/spring-boot/reference/features/ssl.html#features.ssl.reloading > > Regards, > Nicolas > > On Wed, Apr 23, 2025 at 12:14 PM Doğuşcan Namal <namal.dogus...@gmail.com> > wrote: > > > Hi Nicolas, thanks for the FLIP. > > > > I am fully supportive of the motivation and we should be supporting this > > feature. Here are couple of comments from my side: > > > > D1) > > Since you shared the implementation details on the FLIP as well, I would > > like to discuss whether using Java's WatchService is the best choice > here. > > > > I believe that the certificate renewal is not a frequent operation. Even > > once a day certificate renewals are not realistic(except for test cases > > maybe) but let's assume that this covers up the p99.99 of the use cases. > I > > am confident on this estimation since there hasn't been a request from > the > > community for this feature so far, which confirms that people were okay > > with infrequent cluster restarts. Following that it is infrequent, I > > believe that spawning up a thread that watches the file modification > > operations is not the best use of the limited resources on a cluster. > > > > There are some known limitations of the WatchService as well such as > > receiving multiple modification events for the same occurence [1], > > inotify's (WatchService's underlying mechanism in Linux environments) > > problems on containerized environments due to remote file systems [2] or > > receiving events while file content updates are still in progress. [3]. I > > do not know if these limitations are addressed in the newer versions but > > regardless of that it is clear that we may face with some ugly edge cases > > due to that. > > > > Given these complications, I would recommend just creating a new > SSLContext > > after a configured duration is expired. We could record the timestamp > when > > the previous SSLContext is created and update it after a configured > > duration is passed. This will be much easier to test and reason about > when > > it is running on production. This will eliminate the necessity to reason > > about the file modification operations as well. > > > > I briefly skimmed through the classes that need to be modified and it > > looked feasible for me. Let me know what are your comments on these. > > > > Note that this is already used in the Kafka world where a new SSLContext > is > > created after 12 hours. > > > > D2) We could provide a configuration to the user, such as > > "security.ssl.internal.keystore.reload.duration" so they could decide how > > often the new certificates should be loaded. > > > > D3) > > On the other hand, I wonder whether we should also handle supporting > > updating the file paths of the truststores and keystores under this FLIP > as > > well. Since the name of the FLIP is "Handle TLS Certificate Renewal" I > > think we could bring that into scope too :) > > > > Thanks > > > > [1] > > > > > https://stackoverflow.com/questions/16777869/java-7-watchservice-ignoring-multiple-occurrences-of-the-same-event > > > > [2] https://blog.arkey.fr/2019/09/13/watchservice-and-bind-mount/ > > > > [3] > https://surajatreyac.github.io/2014-07-29/reactive_file_handling.html > > > > [4] See also Platform Dependencies - > > > > > https://docs.oracle.com/javase/8/docs/api/index.html?java/nio/file/WatchService.html > > > > > > On Fri, 18 Apr 2025 at 18:25, Gabor Somogyi <gabor.g.somo...@gmail.com> > > wrote: > > > > > Hi Robert, > > > > > > Since I've added the same feature to the operator I'll take a look at > it. > > > Though it won't be lightning fast since I'm having several weeks off. > > > > > > Your questions are valid especially considering the fact that this > > feature > > > touches the hearth of the authentication so this must be rock solid > > > in order to avoid grey hair :) > > > > > > (1) I would vote on a single service which is heavily unit tested with > > > all the possible combinations including threading. > > > > > > Some standalone app could be added to really play with it (that would > > help > > > review). > > > I mean, create X files, start Y threads, and make assertions. > > > The reason why I'm suggesting it is the fact that AFAIR the watch > service > > > is quite sensitive even in single thread. If we could do this in a > finite > > > time > > > consuming unit test then it's even better. > > > > > > (2) +1 on that name to avoid confusion > > > > > > (3) I agree that some e2e is must, however this can be easily and > deeply > > > unit > > > tested so that part is also essential. One key test here is when > > > certificates > > > are not changing then no action must be performed (not to break the > whole > > > system apart). > > > > > > Purely personal opinion but such feature developments are slow by > nature > > > because > > > of edge case / stress testing. > > > > > > BR, > > > G > > > > > > > > > On Fri, Apr 18, 2025 at 4:53 PM Robert Metzger <rmetz...@apache.org> > > > wrote: > > > > > > > Hi Nicolas, > > > > > > > > This looks like a nice improvement, thanks for the write up. > > > > Are you in touch with any committer who's willing to review / merge > > this? > > > > > > > > Some random questions on the FLIP: > > > > (1) "Each service that depends on TLS certificates will initialize a > > > > FileSytemWatchService" > > > > > > > > It seems that there are 4 components using SSL, does this mean there > > will > > > > be 4 additional threads running, watching the same set of files? > > > > Wouldn't it be better to introduce a central file watching service, > and > > > SSL > > > > users can subscribe to updates, to reduce the number of threads? > > > > If this makes the whole effort 4x more complicated, I wouldn't > consider > > > it, > > > > but if its roughly the same effort, we should :) > > > > > > > > (2) "FileSytemWatchService" > > > > When I read this name, I was wondering, whether this is somehow > related > > > to > > > > the Flink "FileSystem" classes. Which I think its' not. > > > > Maybe a different name, that makes this separation more explicit, > would > > > > make sense. Maybe "LocalFSWatchService"? > > > > (I'm sorry to bring up naming stuff -- its very subjective, and > > > difficult) > > > > > > > > (3) For the test plan: There seem to be some SSL related e2e tests: > > > > > > > > > > > > > > https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/common_ssl.sh > > > > It would be nice to extend them to cover this feature as well. I > would > > > hate > > > > for this feature to slowly break by future changes, so good e2e test > > > > coverage is key, in particular bc so many components are involved. > > > > > > > > Best, > > > > Robert > > > > > > > > On Wed, Apr 16, 2025 at 11:55 AM Nicolas Fraison > > > > <nicolas.frai...@datadoghq.com.invalid> wrote: > > > > > > > > > Hi All, > > > > > > > > > > I'd like to start a discussion to Handle TLS Certificate Renewal > > > > > Please provide some feedback on this proposal: > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-523%3A+Handle+TLS+Certificate+Renewal > > > > > > > > > > Regards, > > > > > > > > > > Nicolas Fraison > > > > > > > > > > > > > > >