Hi all,

I was just looking into an old issue again, SerializablePipelineOptions calling 
FileSystems.setDefaultPipelineOptions on deserialization [1]. This applies to 
various runners including Flink and Spark, but not Dataflow as far as I know.

Problem:

Current initialization of FileSystems through 
FileSystems.setDefaultPipelineOptions is rather problematic and prone to race 
conditions, especially when triggered on deserialization of 
SerializablePipelineOptions (see [1], [2], [3]).

Even further, there’s also an inherent risk of leaking resources that way: 
Without a well-defined lifecycle for file systems, existing ones are just 
silently replaced on every invocation of FileSystems.setDefaultPipelineOptions 
without adequately closing attached resources. Particularly with S3FileSystem 
this is troublesome as it might leak threads [4].

Possible solutions:

In the best-case pipeline options would be read only as soon as a pipeline is 
running, making it simple and safe to initialize file systems just once per 
running pipeline on each worker. Though, that’s likely not the case for some 
user pipelines and even runners do mutate pipeline options.

Also, as far as I can see, removing FileSystems.setDefaultPipelineOptions from 
deserialization in SerializablePipelineOptions is unlikely to happen any time 
soon as it requires a coordinated push across various runners and it’s not 
obvious where and when initialization is supposed to happen for each runner.

With the above in mind, it would be at least possible to safely limit repeated 
initialization of file systems to cases when necessary if tracking the revision 
of pipeline options (using monotonically increasing revision numbers for every 
update), see this draft PR [5].

Happy to hear your thoughts or alternative approaches on this.

Best,
Moritz

[1] https://github.com/apache/beam/issues/18430
[2] https://issues.apache.org/jira/browse/BEAM-14465
[3] https://issues.apache.org/jira/browse/BEAM-14355
[4] https://github.com/apache/beam/issues/26321
[5] https://github.com/apache/beam/pull/26694

As a recipient of an email from the Talend Group, your personal data will be 
processed by our systems. Please see our Privacy Notice 
<https://www.talend.com/privacy-policy/> for more information about our 
collection and use of your personal information, our security practices, and 
your data protection rights, including any rights you may have to object to 
automated-decision making or profiling we use to analyze support or marketing 
related communications. To manage or discontinue promotional communications, 
use the communication preferences 
portal<https://info.talend.com/emailpreferencesen.html>. To exercise your data 
protection rights, use the privacy request 
form<https://talend.my.onetrust.com/webform/ef906c5a-de41-4ea0-ba73-96c079cdd15a/b191c71d-f3cb-4a42-9815-0c3ca021704cl>.
 Contact us here <https://www.talend.com/contact/> or by mail to either of our 
co-headquarters: Talend, Inc.: 400 South El Camino Real, Ste 1400, San Mateo, 
CA 94402; Talend SAS: 5/7 rue Salomon De Rothschild, 92150 Suresnes, France

Reply via email to