All data is not kept in memory. However if you have too few shards and
writing the files is slow, data has to be in memory while the file write is
in process.

On Wed, Oct 2, 2024 at 11:16 AM [email protected] <[email protected]> wrote:

> We are trying to process over 150TB data(streaming unbound) per day and
> save them to BQ and it looks like storage api is not economical enough for
> us.  I tried to use file upload but somehow it doesn't work and there are
> not many documents for file upload method online. I have a few questions
> regarding the file_upload method in streaming mode.
> 1. How do I decide numOfFileShards? can I still reply on autosharding?
> 2. I noticed the fileloads method requires much more memory, I'm not sure
> if dataflow runner would keep all the data in memory before writing to
> file? If so even one minute data is too much to be kept in memory and less
> than one minute means would exceed the api quota. Is there a way to cap the
> memory usage like write data to files before trigger file load job?
> 3. I also noticed that if there is a file upload job failure, I don't get
> the error message, so what can I do to handle the error, what is the best
> practice in terms of error handling in file_upload method?
>
> Thanks!
> Regards,
> Siyuan
>

Reply via email to