And could not the dataset be accessed from Cloud Storage? Does it need to be specifically NFS?
On Thu, 26 Jan 2023 at 18:16, Chad Dombrova <chad...@gmail.com> wrote: > Hi all, > We have large data sets which we would like to mount over NFS within > Dataflow. As far as I know, this is not possible. Has anything changed > there? > > The data sets that our beam pipelines operate on are too large -- about a > petabyte -- to put into a custom container or to sync from one location to > another, this is why we need the ability to mount a volume, i.e. using > docker run --mount. > > In the past we used Flink and a custom build of beam to add control docker > run options: we made an MR for that here: > https://github.com/apache/beam/pull/8982/files > > It was not merged into beam due to "a concern about exposing all docker > run options since it would likely be incompatible with cluster managers > that use docker". The discussion was here > https://lists.apache.org/thread.html/fa3fede5d2967daab676d3869d2fc94f505517504c3a849db789ef54%40%3Cdev.beam.apache.org%3E > > I'm happy to revisit this feature with a set of pruned down options, > including --volume and --mount, but we would need Dataflow to support this > as well, because IIRC, the docker run invocation is part of a black box in > Dataflow (i.e. not part of Beam), so as Dataflow users we rely on those > changes from Google. > > Is this a feature that already exists or that we can hope to get in > Dataflow? Or do we need to go back to Flink to support this? > > thanks, > -chad > >