Hi there,

I imagine the answer to this question might depend on the underlying
runner, but simply put: can I write files, temporarily, to disk? I'm
currently using the DataflowRunner if that's a helpful detail.

Relatedly, how does Beam handle large files? Say that my pipeline reads
files from a distributed file system like AWS S3 or GCP Cloud Storage. If a
file is 10 GB and I read its contents, those contents will be held in
memory, correct?

As a somewhat contrived example, what would be the recommended approach if
I wanted to read a set of large files, tar them, and upload them elsewhere?

Thanks!
Evan

Reply via email to