Hi, This remaining part of the doc has been updated
On Thu, Jun 19, 2025 at 9:37 AM Gabor Somogyi <gabor.g.somo...@gmail.com> wrote: > I think in this SSLContextLoader class call() function is not correct and > must be onFileOrDirectoryModified(...), right? > > [image: Screenshot 2025-06-19 at 9.35.14.png] > > On Thu, Jun 19, 2025 at 8:14 AM Nicolas Fraison < > nicolas.frai...@datadoghq.com> wrote: > >> Hi, >> >> Which part of the doc are you referring to? >> >> On Wed, Jun 18, 2025 at 7:08 PM Gabor Somogyi <gabor.g.somo...@gmail.com> >> wrote: >> >>> There is still a call function there which is left there from the >>> previous >>> design... >>> >>> On Tue, Jun 17, 2025 at 9:48 AM Nicolas Fraison >>> <nicolas.frai...@datadoghq.com.invalid> wrote: >>> >>> > Done >>> > >>> > On Mon, Jun 16, 2025 at 2:46 PM Gabor Somogyi < >>> gabor.g.somo...@gmail.com> >>> > wrote: >>> > >>> > > I still see 3 Callable occurrence in the FLIP... >>> > > >>> > > On Thu, Jun 12, 2025 at 2:16 PM Nicolas Fraison >>> > > <nicolas.frai...@datadoghq.com.invalid> wrote: >>> > > >>> > > > Hi, FLIP has been updated with last comment >>> > > > >>> > > > On Thu, Jun 5, 2025 at 6:35 PM Gabor Somogyi < >>> > gabor.g.somo...@gmail.com> >>> > > > wrote: >>> > > > >>> > > > > The interface has been added. Do we intend to pass an >>> > > > > instance registerWatchedPath function as we discussed? >>> > > > > If yes then the FLIP needs further adjustments all places where >>> now >>> > > > > Callable provided. >>> > > > > >>> > > > > G >>> > > > > >>> > > > > >>> > > > > On Thu, Jun 5, 2025 at 2:27 PM Nicolas Fraison >>> > > > > <nicolas.frai...@datadoghq.com.invalid> wrote: >>> > > > > >>> > > > > > Hi, >>> > > > > > >>> > > > > > It indeed make sense, FLIP has been updated >>> > > > > > >>> > > > > > Nicolas >>> > > > > > >>> > > > > > On Wed, Jun 4, 2025 at 12:10 PM Gabor Somogyi < >>> > > > gabor.g.somo...@gmail.com >>> > > > > > >>> > > > > > wrote: >>> > > > > > >>> > > > > > > Hi, >>> > > > > > > >>> > > > > > > I've read it through. Basically looks good with one comment. >>> > > > > > > *registerWatchedPath function* has a *Callable* as callback. >>> For >>> > > the >>> > > > > > > current implementation that would be enough, >>> > > > > > > but when somebody would like to use that for any other >>> use-case >>> > > then >>> > > > it >>> > > > > > > would be hard. Examples: >>> > > > > > > * A single *call* function tells the user nothing what kind >>> of >>> > > event >>> > > > is >>> > > > > > > this >>> > > > > > > * the watch service supports 3 events (create, modify, >>> delete), >>> > now >>> > > > > when >>> > > > > > I >>> > > > > > > register I can get only updates (I presume) >>> > > > > > > * 1+ dir watch with the same callback is not possible >>> > > > > > > >>> > > > > > > My suggestion would be to create a proper callback with all >>> the >>> > > event >>> > > > > > type >>> > > > > > > functions and no-op default behavior with the following >>> names[1]. >>> > > > > > > This is ~30 lines addition but will increase the readability >>> > > heavily >>> > > > + >>> > > > > no >>> > > > > > > need to touch this code later. >>> > > > > > > >>> > > > > > > BR, >>> > > > > > > G >>> > > > > > > >>> > > > > > > [1] >>> > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://github.com/apache/flink-kubernetes-operator/commit/e5a325c48965a50d61d0aa29e61ba79e97f27082#diff-a30b3ed9b8c53e998b15d7da7ad2e54374c98ffc3c920f76a70bce3fb37a9b2eR87-R93 >>> > > > > > > >>> > > > > > > >>> > > > > > > On Tue, Jun 3, 2025 at 8:53 AM Nicolas Fraison >>> > > > > > > <nicolas.frai...@datadoghq.com.invalid> wrote: >>> > > > > > > >>> > > > > > > > Hi, >>> > > > > > > > >>> > > > > > > > The FLIP has been updated. >>> > > > > > > > Let me know if you have some other comments. >>> > > > > > > > >>> > > > > > > > Nicolas >>> > > > > > > > >>> > > > > > > > On Tue, May 20, 2025 at 12:01 PM Nicolas Fraison < >>> > > > > > > > nicolas.frai...@datadoghq.com> wrote: >>> > > > > > > > >>> > > > > > > > > Hi Gabor, >>> > > > > > > > > >>> > > > > > > > > Thanks for the feedback. >>> > > > > > > > > The overall proposal with atomic dirty flag being set by >>> the >>> > > > > callback >>> > > > > > > > will >>> > > > > > > > > indeed work with any kind of implementation. >>> > > > > > > > > >>> > > > > > > > > Will see to update the FLIP in a week or 2 if there are >>> no >>> > > other >>> > > > > > > > comments >>> > > > > > > > > on it >>> > > > > > > > > >>> > > > > > > > > Nicolas >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > On Tue, May 20, 2025 at 11:08 AM Gabor Somogyi < >>> > > > > > > > gabor.g.somo...@gmail.com> >>> > > > > > > > > wrote: >>> > > > > > > > > >>> > > > > > > > >> Hi Nicolas, >>> > > > > > > > >> >>> > > > > > > > >> Related SSLContext I've not gone through all the cases >>> where >>> > > we >>> > > > > need >>> > > > > > > to >>> > > > > > > > >> reload so instead I'm sharing the concept. >>> > > > > > > > >> The main intention is that we must control all execution >>> > paths >>> > > > > which >>> > > > > > > > >> decide >>> > > > > > > > >> which certificate used for authentication. >>> > > > > > > > >> Creating an SSLContext decorator which checks reload >>> first >>> > and >>> > > > > then >>> > > > > > > > >> forwards all calls to the original (wrapped) context >>> > > > > > > > >> is one way to achieve that. If there are different >>> > > > implementations >>> > > > > > > which >>> > > > > > > > >> end up in similar behavior then it's fine. >>> > > > > > > > >> >>> > > > > > > > >> BR, >>> > > > > > > > >> G >>> > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > > >> On Tue, May 20, 2025 at 10:15 AM Nicolas Fraison >>> > > > > > > > >> <nicolas.frai...@datadoghq.com.invalid> wrote: >>> > > > > > > > >> >>> > > > > > > > >> > Hi, >>> > > > > > > > >> > >>> > > > > > > > >> > Overall your proposal looks great. >>> > > > > > > > >> > The event handling must indeed be super fast and we >>> must >>> > > also >>> > > > > not >>> > > > > > > > change >>> > > > > > > > >> > original code path if reload not needed >>> > > > > > > > >> > >>> > > > > > > > >> > I have some concerns around the ReloadableSSLContext >>> > > > > implementing >>> > > > > > > all >>> > > > > > > > >> > SSLContext >>> > > > > > > > >> > Do you really mean SSLContext (java SSLContext one) >>> or do >>> > > you >>> > > > > > refer >>> > > > > > > to >>> > > > > > > > >> the >>> > > > > > > > >> > SslContext from netty? >>> > > > > > > > >> > >>> > > > > > > > >> > If java SSLContext >>> > > > > > > > >> > - I'm not sure how this will manage reload from the >>> > > BlobServer >>> > > > > > > > >> > BlobServer relies on creation of an >>> SSLServerSocketFactory >>> > > > from >>> > > > > > the >>> > > > > > > > >> > SSLContext. >>> > > > > > > > >> > But from my current understanding the >>> > SSLServerSocketFactory >>> > > > > does >>> > > > > > > not >>> > > > > > > > >> have >>> > > > > > > > >> > any connection with the SSLContext. >>> > > > > > > > >> > It only has some with an SSLContextImpl extends >>> > > SSLContextSpi. >>> > > > > > > > >> > I think we will need a callback here to enforce >>> recreation >>> > > of >>> > > > > the >>> > > > > > > > >> > BlobServer socket. >>> > > > > > > > >> > - I'm also don't see how to attach this to the >>> SslContext >>> > > from >>> > > > > > netty >>> > > > > > > > >> > >>> > > > > > > > >> > If netty SslContext >>> > > > > > > > >> > - we would still need to have the callback to >>> recreate the >>> > > > > > > BlobServer >>> > > > > > > > >> > socket >>> > > > > > > > >> > - for netty and pekko we should be able to rely on it >>> > > > > > > > >> > >>> > > > > > > > >> > Nicolas >>> > > > > > > > >> > >>> > > > > > > > >> > On Wed, May 7, 2025 at 12:40 PM Gabor Somogyi < >>> > > > > > > > >> gabor.g.somo...@gmail.com> >>> > > > > > > > >> > wrote: >>> > > > > > > > >> > >>> > > > > > > > >> > > Hi All, >>> > > > > > > > >> > > >>> > > > > > > > >> > > I've read through the concerns/proposals related the >>> > watch >>> > > > > > service >>> > > > > > > > and >>> > > > > > > > >> > here >>> > > > > > > > >> > > are my conclusions: >>> > > > > > > > >> > > * Watch service can watch local file systems only: >>> It's >>> > > fair >>> > > > > to >>> > > > > > > say >>> > > > > > > > >> that >>> > > > > > > > >> > > certificates must be copied to local FS in order to >>> work >>> > > > (init >>> > > > > > > > >> container >>> > > > > > > > >> > cp >>> > > > > > > > >> > > command or something) >>> > > > > > > > >> > > * Watch service can send multiple events even for a >>> > single >>> > > > > > > directory >>> > > > > > > > >> > > change: this can be mitigated with a single atomic >>> dirty >>> > > > flag >>> > > > > > (see >>> > > > > > > > my >>> > > > > > > > >> > > design suggestion) >>> > > > > > > > >> > > * Polling file modification time: When I hear any >>> kind >>> > of >>> > > > > > polling >>> > > > > > > > >> > > implemented by us is just something I'm mostly >>> opposing >>> > > > > > > > >> > > * security.ssl.internal.keystore.reload.duration: >>> > Security >>> > > > > > > features >>> > > > > > > > >> must >>> > > > > > > > >> > be >>> > > > > > > > >> > > executed nearly immediately no matter what >>> > > > > > > > >> > > >>> > > > > > > > >> > > As a general remark, not sure how many users are >>> using >>> > the >>> > > > > watch >>> > > > > > > > >> service >>> > > > > > > > >> > > based approach in the operator but until now I've >>> not >>> > seen >>> > > > any >>> > > > > > > issue >>> > > > > > > > >> > > related to that. >>> > > > > > > > >> > > If somebody is having some specifics then please >>> share. >>> > > > > > > > >> > > >>> > > > > > > > >> > > For now I would shoot for keystore not to have >>> feature >>> > > > creep. >>> > > > > > When >>> > > > > > > > >> there >>> > > > > > > > >> > is >>> > > > > > > > >> > > a valid use-case, community interest and the >>> keystore >>> > > story >>> > > > is >>> > > > > > > > already >>> > > > > > > > >> > rock >>> > > > > > > > >> > > stable >>> > > > > > > > >> > > then we can consider to involve truststore later. >>> > > > > > > > >> > > >>> > > > > > > > >> > > Having the mentioned assumption that the operator >>> > approach >>> > > > > > works, >>> > > > > > > > >> here is >>> > > > > > > > >> > > my high level proposal: >>> > > > > > > > >> > > * Let's have enable flag for all such watch >>> > functionality. >>> > > > If >>> > > > > > it's >>> > > > > > > > >> false >>> > > > > > > > >> > > then the currently existing functionality must >>> remain >>> > > as-is >>> > > > > > > > >> > > * Let's have a LocalFSWatchService which is a >>> singleton >>> > > > which >>> > > > > > has >>> > > > > > > no >>> > > > > > > > >> path >>> > > > > > > > >> > > registrations by default >>> > > > > > > > >> > > * Add path registration functionality which is >>> > > synchronised >>> > > > > > where >>> > > > > > > a >>> > > > > > > > >> > > callback can be registered >>> > > > > > > > >> > > * Let's have a ReloadableSSLContext which >>> implements the >>> > > > > > mentioned >>> > > > > > > > >> > callback >>> > > > > > > > >> > > * Inside the callback set an atomic dirty flag only >>> > (this >>> > > > can >>> > > > > > > handle >>> > > > > > > > >> > > multiple events for the same directory change + >>> event >>> > > > handling >>> > > > > > in >>> > > > > > > > the >>> > > > > > > > >> > watch >>> > > > > > > > >> > > service must be extreme fast) >>> > > > > > > > >> > > * Inside ReloadableSSLContext all SSLContext actions >>> > must >>> > > be >>> > > > > > > > >> overridden. >>> > > > > > > > >> > At >>> > > > > > > > >> > > the beginning of each function dirty flag must be >>> > checked >>> > > > > > > > >> > > and if dirty then certificates must be reloaded, >>> flag >>> > can >>> > > be >>> > > > > set >>> > > > > > > > back >>> > > > > > > > >> to >>> > > > > > > > >> > > false (then original functionality call). It's >>> extremely >>> > > > > > important >>> > > > > > > > >> that >>> > > > > > > > >> > > context reload must be synchronised. >>> > > > > > > > >> > > A synchronised boolean check + possible context >>> reload >>> > can >>> > > > > > consume >>> > > > > > > > >> some >>> > > > > > > > >> > > time but I wouldn't expect any significant >>> performance >>> > > drop. >>> > > > > > > > >> > > >>> > > > > > > > >> > > My proposal main drivers: >>> > > > > > > > >> > > * Original code path must run when no watch service >>> > asked >>> > > > > > > > >> > > * Super fast event handling because million events >>> may >>> > > come >>> > > > in >>> > > > > > > (not >>> > > > > > > > >> > > expecting but we should be prepared) >>> > > > > > > > >> > > * Clean separation between dir/file watch and >>> file/dir >>> > > usage >>> > > > > > > > >> > > * Well considered synchronisation model >>> > > > > > > > >> > > * Extensive unit testing because we're intended to >>> touch >>> > > the >>> > > > > > heart >>> > > > > > > > of >>> > > > > > > > >> > Flink >>> > > > > > > > >> > > >>> > > > > > > > >> > > Happy to hear other opinions. >>> > > > > > > > >> > > >>> > > > > > > > >> > > BR, >>> > > > > > > > >> > > G >>> > > > > > > > >> > > >>> > > > > > > > >> > > >>> > > > > > > > >> > > On Mon, Apr 28, 2025 at 1:54 PM Nicolas Fraison >>> > > > > > > > >> > > <nicolas.frai...@datadoghq.com.invalid> wrote: >>> > > > > > > > >> > > >>> > > > > > > > >> > > > Thanks all for your feedback and sorry for the >>> late >>> > > answer >>> > > > > (I >>> > > > > > > was >>> > > > > > > > on >>> > > > > > > > >> > > > holiday). >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > 1. Indeed it would add 4 threads. >>> > > > > > > > >> > > > For the non pekko component we can indeed have one >>> > > watcher >>> > > > > > > service >>> > > > > > > > >> used >>> > > > > > > > >> > > to >>> > > > > > > > >> > > > reload SSLContext for those components >>> > > > > > > > >> > > > For pekko this is a little more challenging as the >>> > > > creation >>> > > > > of >>> > > > > > > the >>> > > > > > > > >> > pekko >>> > > > > > > > >> > > > ssl engine is managed by pekko himself. >>> > > > > > > > >> > > > Flink only generates appropriate config with >>> class to >>> > > > > execute >>> > > > > > to >>> > > > > > > > >> > initiate >>> > > > > > > > >> > > > the pekko ssl engine [1]. This means that I will >>> not >>> > be >>> > > > able >>> > > > > > to >>> > > > > > > > >> provide >>> > > > > > > > >> > > the >>> > > > > > > > >> > > > watcher service to this ssl engine. >>> > > > > > > > >> > > > One solution would be to rely on a singleton >>> instead >>> > of >>> > > a >>> > > > > > > service >>> > > > > > > > >> > > injected >>> > > > > > > > >> > > > in each component but I'm not sure this is fine >>> to use >>> > > > such >>> > > > > in >>> > > > > > > > >> flink. >>> > > > > > > > >> > > > WDYT? >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > We can also add a specific flink configuration >>> > > > > > > > >> > > > (security.ssl.internal.keystore.reload.enable) to >>> only >>> > > add >>> > > > > > this >>> > > > > > > > >> watcher >>> > > > > > > > >> > > > mechanism if the config is enabled to avoid adding >>> > those >>> > > > > > threads >>> > > > > > > > if >>> > > > > > > > >> > this >>> > > > > > > > >> > > is >>> > > > > > > > >> > > > not needed. >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > 2. I'm fine with the LocalFSWatchService naming. >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > 3. Also agree that some e2e tests must be added. >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > @doguscan namal, Thanks for challenging the >>> proposal. >>> > > > > > > > >> > > > FYI, we are planning to rely on certificates with >>> > really >>> > > > > short >>> > > > > > > > >> > validity. >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > D1. It seems that the proposal to rely on a reload >>> > > period >>> > > > > will >>> > > > > > > > still >>> > > > > > > > >> > face >>> > > > > > > > >> > > > potential issues: >>> > > > > > > > >> > > > - receiving events while file content updates are >>> > still >>> > > in >>> > > > > > > > progress: >>> > > > > > > > >> > > > There is no guarantee that we will not load the >>> > > > certificate >>> > > > > > > while >>> > > > > > > > >> file >>> > > > > > > > >> > > > content updates are still in progress >>> > > > > > > > >> > > > - while it will not be affected by multiple >>> > > notifications, >>> > > > > we >>> > > > > > > can >>> > > > > > > > >> > reach a >>> > > > > > > > >> > > > point where only the truststore is updated when >>> the >>> > > reload >>> > > > > > > > happens. >>> > > > > > > > >> > > > Which means that if the keystore is updated just >>> > after, >>> > > it >>> > > > > > will >>> > > > > > > > not >>> > > > > > > > >> be >>> > > > > > > > >> > > > taken in account before next run of the reload >>> > mechanism >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > I think that with WatchService and appropriate >>> reload >>> > > > grace >>> > > > > > > period >>> > > > > > > > >> > > > mechanism we should be able to mitigate those 2 >>> issues >>> > > > > > (ensuring >>> > > > > > > > >> > minimum >>> > > > > > > > >> > > > reload even with multiple notify) >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > From KIP-1119 [2] it looks like the same kind of >>> > > > > requirements >>> > > > > > is >>> > > > > > > > >> under >>> > > > > > > > >> > > > discussion for Kafka to also rely on the >>> WatchService >>> > > Java >>> > > > > API >>> > > > > > > > >> > > (SpringBoot >>> > > > > > > > >> > > > seems to also rely on this API to manage ssl >>> reload >>> > > [3]). >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > D3. Do we have a real use case for this? >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > [1] >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > >>> > > > > > > > >> > >>> > > > > > > > >> >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://github.com/apache/flink/blob/b523264ab45d37cd9584a0e8c06f1ef6bd1aaed7/flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/pekko/PekkoUtils.java#L372 >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > [2] >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > >>> > > > > > > > >> > >>> > > > > > > > >> >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1119%3A+Add+support+for+SSL+auto+reload >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > [3] >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > >>> > > > > > > > >> > >>> > > > > > > > >> >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://docs.spring.io/spring-boot/reference/features/ssl.html#features.ssl.reloading >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > Regards, >>> > > > > > > > >> > > > Nicolas >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > On Wed, Apr 23, 2025 at 12:14 PM Doğuşcan Namal < >>> > > > > > > > >> > > namal.dogus...@gmail.com> >>> > > > > > > > >> > > > wrote: >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > > Hi Nicolas, thanks for the FLIP. >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > I am fully supportive of the motivation and we >>> > should >>> > > be >>> > > > > > > > >> supporting >>> > > > > > > > >> > > this >>> > > > > > > > >> > > > > feature. Here are couple of comments from my >>> side: >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > D1) >>> > > > > > > > >> > > > > Since you shared the implementation details on >>> the >>> > > FLIP >>> > > > as >>> > > > > > > > well, I >>> > > > > > > > >> > > would >>> > > > > > > > >> > > > > like to discuss whether using Java's >>> WatchService is >>> > > the >>> > > > > > best >>> > > > > > > > >> choice >>> > > > > > > > >> > > > here. >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > I believe that the certificate renewal is not a >>> > > frequent >>> > > > > > > > >> operation. >>> > > > > > > > >> > > Even >>> > > > > > > > >> > > > > once a day certificate renewals are not >>> > > realistic(except >>> > > > > for >>> > > > > > > > test >>> > > > > > > > >> > cases >>> > > > > > > > >> > > > > maybe) but let's assume that this covers up the >>> > p99.99 >>> > > > of >>> > > > > > the >>> > > > > > > > use >>> > > > > > > > >> > > cases. >>> > > > > > > > >> > > > I >>> > > > > > > > >> > > > > am confident on this estimation since there >>> hasn't >>> > > been >>> > > > a >>> > > > > > > > request >>> > > > > > > > >> > from >>> > > > > > > > >> > > > the >>> > > > > > > > >> > > > > community for this feature so far, which >>> confirms >>> > that >>> > > > > > people >>> > > > > > > > were >>> > > > > > > > >> > okay >>> > > > > > > > >> > > > > with infrequent cluster restarts. Following >>> that it >>> > is >>> > > > > > > > >> infrequent, I >>> > > > > > > > >> > > > > believe that spawning up a thread that watches >>> the >>> > > file >>> > > > > > > > >> modification >>> > > > > > > > >> > > > > operations is not the best use of the limited >>> > > resources >>> > > > > on a >>> > > > > > > > >> cluster. >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > There are some known limitations of the >>> WatchService >>> > > as >>> > > > > well >>> > > > > > > > such >>> > > > > > > > >> as >>> > > > > > > > >> > > > > receiving multiple modification events for the >>> same >>> > > > > > occurence >>> > > > > > > > [1], >>> > > > > > > > >> > > > > inotify's (WatchService's underlying mechanism >>> in >>> > > Linux >>> > > > > > > > >> environments) >>> > > > > > > > >> > > > > problems on containerized environments due to >>> remote >>> > > > file >>> > > > > > > > systems >>> > > > > > > > >> [2] >>> > > > > > > > >> > > or >>> > > > > > > > >> > > > > receiving events while file content updates are >>> > still >>> > > in >>> > > > > > > > progress. >>> > > > > > > > >> > > [3]. I >>> > > > > > > > >> > > > > do not know if these limitations are addressed >>> in >>> > the >>> > > > > newer >>> > > > > > > > >> versions >>> > > > > > > > >> > > but >>> > > > > > > > >> > > > > regardless of that it is clear that we may face >>> with >>> > > > some >>> > > > > > ugly >>> > > > > > > > >> edge >>> > > > > > > > >> > > cases >>> > > > > > > > >> > > > > due to that. >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > Given these complications, I would recommend >>> just >>> > > > > creating a >>> > > > > > > new >>> > > > > > > > >> > > > SSLContext >>> > > > > > > > >> > > > > after a configured duration is expired. We could >>> > > record >>> > > > > the >>> > > > > > > > >> timestamp >>> > > > > > > > >> > > > when >>> > > > > > > > >> > > > > the previous SSLContext is created and update it >>> > > after a >>> > > > > > > > >> configured >>> > > > > > > > >> > > > > duration is passed. This will be much easier to >>> test >>> > > and >>> > > > > > > reason >>> > > > > > > > >> about >>> > > > > > > > >> > > > when >>> > > > > > > > >> > > > > it is running on production. This will >>> eliminate the >>> > > > > > necessity >>> > > > > > > > to >>> > > > > > > > >> > > reason >>> > > > > > > > >> > > > > about the file modification operations as well. >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > I briefly skimmed through the classes that need >>> to >>> > be >>> > > > > > modified >>> > > > > > > > >> and it >>> > > > > > > > >> > > > > looked feasible for me. Let me know what are >>> your >>> > > > comments >>> > > > > > on >>> > > > > > > > >> these. >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > Note that this is already used in the Kafka >>> world >>> > > where >>> > > > a >>> > > > > > new >>> > > > > > > > >> > > SSLContext >>> > > > > > > > >> > > > is >>> > > > > > > > >> > > > > created after 12 hours. >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > D2) We could provide a configuration to the >>> user, >>> > such >>> > > > as >>> > > > > > > > >> > > > > >>> "security.ssl.internal.keystore.reload.duration" so >>> > > they >>> > > > > > could >>> > > > > > > > >> decide >>> > > > > > > > >> > > how >>> > > > > > > > >> > > > > often the new certificates should be loaded. >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > D3) >>> > > > > > > > >> > > > > On the other hand, I wonder whether we should >>> also >>> > > > handle >>> > > > > > > > >> supporting >>> > > > > > > > >> > > > > updating the file paths of the truststores and >>> > > keystores >>> > > > > > under >>> > > > > > > > >> this >>> > > > > > > > >> > > FLIP >>> > > > > > > > >> > > > as >>> > > > > > > > >> > > > > well. Since the name of the FLIP is "Handle TLS >>> > > > > Certificate >>> > > > > > > > >> Renewal" >>> > > > > > > > >> > I >>> > > > > > > > >> > > > > think we could bring that into scope too :) >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > Thanks >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > [1] >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > >>> > > > > > > > >> > >>> > > > > > > > >> >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://stackoverflow.com/questions/16777869/java-7-watchservice-ignoring-multiple-occurrences-of-the-same-event >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > [2] >>> > > > > > > > >>> https://blog.arkey.fr/2019/09/13/watchservice-and-bind-mount/ >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > [3] >>> > > > > > > > >> > > > >>> > > > > > > > >> >>> > > > > > >>> > > >>> https://surajatreyac.github.io/2014-07-29/reactive_file_handling.html >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > [4] See also Platform Dependencies - >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > >>> > > > > > > > >> > >>> > > > > > > > >> >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://docs.oracle.com/javase/8/docs/api/index.html?java/nio/file/WatchService.html >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > On Fri, 18 Apr 2025 at 18:25, Gabor Somogyi < >>> > > > > > > > >> > gabor.g.somo...@gmail.com >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > > wrote: >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > > > Hi Robert, >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > > Since I've added the same feature to the >>> operator >>> > > I'll >>> > > > > > take >>> > > > > > > a >>> > > > > > > > >> look >>> > > > > > > > >> > at >>> > > > > > > > >> > > > it. >>> > > > > > > > >> > > > > > Though it won't be lightning fast since I'm >>> having >>> > > > > several >>> > > > > > > > weeks >>> > > > > > > > >> > off. >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > > Your questions are valid especially >>> considering >>> > the >>> > > > fact >>> > > > > > > that >>> > > > > > > > >> this >>> > > > > > > > >> > > > > feature >>> > > > > > > > >> > > > > > touches the hearth of the authentication so >>> this >>> > > must >>> > > > be >>> > > > > > > rock >>> > > > > > > > >> solid >>> > > > > > > > >> > > > > > in order to avoid grey hair :) >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > > (1) I would vote on a single service which is >>> > > heavily >>> > > > > unit >>> > > > > > > > >> tested >>> > > > > > > > >> > > with >>> > > > > > > > >> > > > > > all the possible combinations including >>> threading. >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > > Some standalone app could be added to really >>> play >>> > > with >>> > > > > it >>> > > > > > > > (that >>> > > > > > > > >> > would >>> > > > > > > > >> > > > > help >>> > > > > > > > >> > > > > > review). >>> > > > > > > > >> > > > > > I mean, create X files, start Y threads, and >>> make >>> > > > > > > assertions. >>> > > > > > > > >> > > > > > The reason why I'm suggesting it is the fact >>> that >>> > > > AFAIR >>> > > > > > the >>> > > > > > > > >> watch >>> > > > > > > > >> > > > service >>> > > > > > > > >> > > > > > is quite sensitive even in single thread. If >>> we >>> > > could >>> > > > do >>> > > > > > > this >>> > > > > > > > >> in a >>> > > > > > > > >> > > > finite >>> > > > > > > > >> > > > > > time >>> > > > > > > > >> > > > > > consuming unit test then it's even better. >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > > (2) +1 on that name to avoid confusion >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > > (3) I agree that some e2e is must, however >>> this >>> > can >>> > > be >>> > > > > > > easily >>> > > > > > > > >> and >>> > > > > > > > >> > > > deeply >>> > > > > > > > >> > > > > > unit >>> > > > > > > > >> > > > > > tested so that part is also essential. One key >>> > test >>> > > > here >>> > > > > > is >>> > > > > > > > when >>> > > > > > > > >> > > > > > certificates >>> > > > > > > > >> > > > > > are not changing then no action must be >>> performed >>> > > (not >>> > > > > to >>> > > > > > > > break >>> > > > > > > > >> the >>> > > > > > > > >> > > > whole >>> > > > > > > > >> > > > > > system apart). >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > > Purely personal opinion but such feature >>> > > developments >>> > > > > are >>> > > > > > > slow >>> > > > > > > > >> by >>> > > > > > > > >> > > > nature >>> > > > > > > > >> > > > > > because >>> > > > > > > > >> > > > > > of edge case / stress testing. >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > > BR, >>> > > > > > > > >> > > > > > G >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > > On Fri, Apr 18, 2025 at 4:53 PM Robert >>> Metzger < >>> > > > > > > > >> > rmetz...@apache.org> >>> > > > > > > > >> > > > > > wrote: >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > > > Hi Nicolas, >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > This looks like a nice improvement, thanks >>> for >>> > the >>> > > > > write >>> > > > > > > up. >>> > > > > > > > >> > > > > > > Are you in touch with any committer who's >>> > willing >>> > > to >>> > > > > > > review >>> > > > > > > > / >>> > > > > > > > >> > merge >>> > > > > > > > >> > > > > this? >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > Some random questions on the FLIP: >>> > > > > > > > >> > > > > > > (1) "Each service that depends on TLS >>> > > certificates >>> > > > > will >>> > > > > > > > >> > > initialize a >>> > > > > > > > >> > > > > > > FileSytemWatchService" >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > It seems that there are 4 components using >>> SSL, >>> > > does >>> > > > > > this >>> > > > > > > > mean >>> > > > > > > > >> > > there >>> > > > > > > > >> > > > > will >>> > > > > > > > >> > > > > > > be 4 additional threads running, watching >>> the >>> > same >>> > > > set >>> > > > > > of >>> > > > > > > > >> files? >>> > > > > > > > >> > > > > > > Wouldn't it be better to introduce a central >>> > file >>> > > > > > watching >>> > > > > > > > >> > service, >>> > > > > > > > >> > > > and >>> > > > > > > > >> > > > > > SSL >>> > > > > > > > >> > > > > > > users can subscribe to updates, to reduce >>> the >>> > > number >>> > > > > of >>> > > > > > > > >> threads? >>> > > > > > > > >> > > > > > > If this makes the whole effort 4x more >>> > > complicated, >>> > > > I >>> > > > > > > > wouldn't >>> > > > > > > > >> > > > consider >>> > > > > > > > >> > > > > > it, >>> > > > > > > > >> > > > > > > but if its roughly the same effort, we >>> should :) >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > (2) "FileSytemWatchService" >>> > > > > > > > >> > > > > > > When I read this name, I was wondering, >>> whether >>> > > this >>> > > > > is >>> > > > > > > > >> somehow >>> > > > > > > > >> > > > related >>> > > > > > > > >> > > > > > to >>> > > > > > > > >> > > > > > > the Flink "FileSystem" classes. Which I >>> think >>> > its' >>> > > > > not. >>> > > > > > > > >> > > > > > > Maybe a different name, that makes this >>> > separation >>> > > > > more >>> > > > > > > > >> explicit, >>> > > > > > > > >> > > > would >>> > > > > > > > >> > > > > > > make sense. Maybe "LocalFSWatchService"? >>> > > > > > > > >> > > > > > > (I'm sorry to bring up naming stuff -- its >>> very >>> > > > > > > subjective, >>> > > > > > > > >> and >>> > > > > > > > >> > > > > > difficult) >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > (3) For the test plan: There seem to be >>> some SSL >>> > > > > related >>> > > > > > > e2e >>> > > > > > > > >> > tests: >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > >>> > > > > > > > >> > >>> > > > > > > > >> >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/common_ssl.sh >>> > > > > > > > >> > > > > > > It would be nice to extend them to cover >>> this >>> > > > feature >>> > > > > as >>> > > > > > > > >> well. I >>> > > > > > > > >> > > > would >>> > > > > > > > >> > > > > > hate >>> > > > > > > > >> > > > > > > for this feature to slowly break by future >>> > > changes, >>> > > > so >>> > > > > > > good >>> > > > > > > > >> e2e >>> > > > > > > > >> > > test >>> > > > > > > > >> > > > > > > coverage is key, in particular bc so many >>> > > components >>> > > > > are >>> > > > > > > > >> > involved. >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > Best, >>> > > > > > > > >> > > > > > > Robert >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > On Wed, Apr 16, 2025 at 11:55 AM Nicolas >>> Fraison >>> > > > > > > > >> > > > > > > <nicolas.frai...@datadoghq.com.invalid> >>> wrote: >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > > > Hi All, >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > > I'd like to start a discussion to Handle >>> TLS >>> > > > > > Certificate >>> > > > > > > > >> > Renewal >>> > > > > > > > >> > > > > > > > Please provide some feedback on this >>> proposal: >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > >>> > > > > > > > >> > >>> > > > > > > > >> >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-523%3A+Handle+TLS+Certificate+Renewal >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > > Regards, >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > > Nicolas Fraison >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > >>> > > > > > > > >> > >>> > > > > > > > >> >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >>