Hi all,
We have large data sets which we would like to mount over NFS within
Dataflow.  As far as I know, this is not possible.  Has anything changed
there?

The data sets that our beam pipelines operate on are too large -- about a
petabyte -- to put into a custom container or to sync from one location to
another, this is why we need the ability to mount a volume, i.e. using
docker run --mount.

In the past we used Flink and a custom build of beam to add control docker
run options: we made an MR for that here:
https://github.com/apache/beam/pull/8982/files

It was not merged into beam due to "a concern about exposing all docker run
options since it would likely be incompatible with cluster managers that
use docker".  The discussion was here
https://lists.apache.org/thread.html/fa3fede5d2967daab676d3869d2fc94f505517504c3a849db789ef54%40%3Cdev.beam.apache.org%3E

I'm happy to revisit this feature with a set of pruned down options,
including --volume and --mount, but we would need Dataflow to support this
as well, because IIRC, the docker run invocation is part of a black box in
Dataflow (i.e. not part of Beam), so as Dataflow users we rely on those
changes from Google.

Is this a feature that already exists or that we can hope to get in
Dataflow?  Or do we need to go back to Flink to support this?

thanks,
-chad

Reply via email to