Can I write to disk (using Dataflow)? Handling large files?

Evan Galpin Fri, 16 Jul 2021 06:53:36 -0700

Hi there,

I imagine the answer to this question might depend on the underlying
runner, but simply put: can I write files, temporarily, to disk? I'm
currently using the DataflowRunner if that's a helpful detail.


Relatedly, how does Beam handle large files? Say that my pipeline reads
files from a distributed file system like AWS S3 or GCP Cloud Storage. If a
file is 10 GB and I read its contents, those contents will be held in
memory, correct?

As a somewhat contrived example, what would be the recommended approach if
I wanted to read a set of large files, tar them, and upload them elsewhere?

Thanks!
Evan

Can I write to disk (using Dataflow)? Handling large files?

Reply via email to