On Mon, May 15, 2023 at 8:38 AM Moritz Mack <mm...@talend.com> wrote: > > Hi all, > > I was just looking into an old issue again, SerializablePipelineOptions > calling FileSystems.setDefaultPipelineOptions on deserialization [1]. This > applies to various runners including Flink and Spark, but not Dataflow as far > as I know. > > Problem: > > Current initialization of FileSystems through > FileSystems.setDefaultPipelineOptions is rather problematic and prone to race > conditions, especially when triggered on deserialization of > SerializablePipelineOptions (see [1], [2], [3]). > > Even further, there’s also an inherent risk of leaking resources that way: > Without a well-defined lifecycle for file systems, existing ones are just > silently replaced on every invocation of > FileSystems.setDefaultPipelineOptions without adequately closing attached > resources. Particularly with S3FileSystem this is troublesome as it might > leak threads [4]. > > Possible solutions: > > In the best-case pipeline options would be read only as soon as a pipeline is > running, making it simple and safe to initialize file systems just once per > running pipeline on each worker. Though, that’s likely not the case for some > user pipelines and even runners do mutate pipeline options.
Would it be possible to validate this? E.g. what if we made it illegal to call this more than once (or, at least, without a corresponding unset operation that could be used for tests)? > Also, as far as I can see, removing FileSystems.setDefaultPipelineOptions > from deserialization in SerializablePipelineOptions is unlikely to happen any > time soon as it requires a coordinated push across various runners and it’s > not obvious where and when initialization is supposed to happen for each > runner. > > With the above in mind, it would be at least possible to safely limit > repeated initialization of file systems to cases when necessary if tracking > the revision of pipeline options (using monotonically increasing revision > numbers for every update), see this draft PR [5]. If the above is too hard to untangle, this could be a reasonable workaround. > Happy to hear your thoughts or alternative approaches on this. Thanks for taking this on. > [1] https://github.com/apache/beam/issues/18430 > > [2] https://issues.apache.org/jira/browse/BEAM-14465 > > [3] https://issues.apache.org/jira/browse/BEAM-14355 > > [4] https://github.com/apache/beam/issues/26321 > > [5] https://github.com/apache/beam/pull/26694 > > As a recipient of an email from the Talend Group, your personal data will be > processed by our systems. Please see our Privacy Notice for more information > about our collection and use of your personal information, our security > practices, and your data protection rights, including any rights you may have > to object to automated-decision making or profiling we use to analyze support > or marketing related communications. To manage or discontinue promotional > communications, use the communication preferences portal. To exercise your > data protection rights, use the privacy request form. Contact us here or by > mail to either of our co-headquarters: Talend, Inc.: 400 South El Camino > Real, Ste 1400, San Mateo, CA 94402; Talend SAS: 5/7 rue Salomon De > Rothschild, 92150 Suresnes, France