Hi all, We have large data sets which we would like to mount over NFS within Dataflow. As far as I know, this is not possible. Has anything changed there?
The data sets that our beam pipelines operate on are too large -- about a petabyte -- to put into a custom container or to sync from one location to another, this is why we need the ability to mount a volume, i.e. using docker run --mount. In the past we used Flink and a custom build of beam to add control docker run options: we made an MR for that here: https://github.com/apache/beam/pull/8982/files It was not merged into beam due to "a concern about exposing all docker run options since it would likely be incompatible with cluster managers that use docker". The discussion was here https://lists.apache.org/thread.html/fa3fede5d2967daab676d3869d2fc94f505517504c3a849db789ef54%40%3Cdev.beam.apache.org%3E I'm happy to revisit this feature with a set of pruned down options, including --volume and --mount, but we would need Dataflow to support this as well, because IIRC, the docker run invocation is part of a black box in Dataflow (i.e. not part of Beam), so as Dataflow users we rely on those changes from Google. Is this a feature that already exists or that we can hope to get in Dataflow? Or do we need to go back to Flink to support this? thanks, -chad