Re: Dataflow and mounting large data sets

Israel Herraiz via user Mon, 30 Jan 2023 02:39:04 -0800

And could not the dataset be accessed from Cloud Storage? Does it need to
be specifically NFS?


On Thu, 26 Jan 2023 at 18:16, Chad Dombrova <chad...@gmail.com> wrote:

> Hi all,
> We have large data sets which we would like to mount over NFS within
> Dataflow.  As far as I know, this is not possible.  Has anything changed
> there?
>
> The data sets that our beam pipelines operate on are too large -- about a
> petabyte -- to put into a custom container or to sync from one location to
> another, this is why we need the ability to mount a volume, i.e. using
> docker run --mount.
>
> In the past we used Flink and a custom build of beam to add control docker
> run options: we made an MR for that here:
> https://github.com/apache/beam/pull/8982/files
>
> It was not merged into beam due to "a concern about exposing all docker
> run options since it would likely be incompatible with cluster managers
> that use docker".  The discussion was here
> https://lists.apache.org/thread.html/fa3fede5d2967daab676d3869d2fc94f505517504c3a849db789ef54%40%3Cdev.beam.apache.org%3E
>
> I'm happy to revisit this feature with a set of pruned down options,
> including --volume and --mount, but we would need Dataflow to support this
> as well, because IIRC, the docker run invocation is part of a black box in
> Dataflow (i.e. not part of Beam), so as Dataflow users we rely on those
> changes from Google.
>
> Is this a feature that already exists or that we can hope to get in
> Dataflow?  Or do we need to go back to Flink to support this?
>
> thanks,
> -chad
>
>

Re: Dataflow and mounting large data sets

Reply via email to