Hi all, I was just looking into an old issue again, SerializablePipelineOptions calling FileSystems.setDefaultPipelineOptions on deserialization [1]. This applies to various runners including Flink and Spark, but not Dataflow as far as I know.
Problem: Current initialization of FileSystems through FileSystems.setDefaultPipelineOptions is rather problematic and prone to race conditions, especially when triggered on deserialization of SerializablePipelineOptions (see [1], [2], [3]). Even further, there’s also an inherent risk of leaking resources that way: Without a well-defined lifecycle for file systems, existing ones are just silently replaced on every invocation of FileSystems.setDefaultPipelineOptions without adequately closing attached resources. Particularly with S3FileSystem this is troublesome as it might leak threads [4]. Possible solutions: In the best-case pipeline options would be read only as soon as a pipeline is running, making it simple and safe to initialize file systems just once per running pipeline on each worker. Though, that’s likely not the case for some user pipelines and even runners do mutate pipeline options. Also, as far as I can see, removing FileSystems.setDefaultPipelineOptions from deserialization in SerializablePipelineOptions is unlikely to happen any time soon as it requires a coordinated push across various runners and it’s not obvious where and when initialization is supposed to happen for each runner. With the above in mind, it would be at least possible to safely limit repeated initialization of file systems to cases when necessary if tracking the revision of pipeline options (using monotonically increasing revision numbers for every update), see this draft PR [5]. Happy to hear your thoughts or alternative approaches on this. Best, Moritz [1] https://github.com/apache/beam/issues/18430 [2] https://issues.apache.org/jira/browse/BEAM-14465 [3] https://issues.apache.org/jira/browse/BEAM-14355 [4] https://github.com/apache/beam/issues/26321 [5] https://github.com/apache/beam/pull/26694 As a recipient of an email from the Talend Group, your personal data will be processed by our systems. Please see our Privacy Notice <https://www.talend.com/privacy-policy/> for more information about our collection and use of your personal information, our security practices, and your data protection rights, including any rights you may have to object to automated-decision making or profiling we use to analyze support or marketing related communications. To manage or discontinue promotional communications, use the communication preferences portal<https://info.talend.com/emailpreferencesen.html>. To exercise your data protection rights, use the privacy request form<https://talend.my.onetrust.com/webform/ef906c5a-de41-4ea0-ba73-96c079cdd15a/b191c71d-f3cb-4a42-9815-0c3ca021704cl>. Contact us here <https://www.talend.com/contact/> or by mail to either of our co-headquarters: Talend, Inc.: 400 South El Camino Real, Ste 1400, San Mateo, CA 94402; Talend SAS: 5/7 rue Salomon De Rothschild, 92150 Suresnes, France