Thanks, so autosharding is recommended for file upload method as well?

On Wed, Oct 2, 2024 at 12:20 PM Reuven Lax via user <[email protected]>
wrote:

> All data is not kept in memory. However if you have too few shards and
> writing the files is slow, data has to be in memory while the file write is
> in process.
>
> On Wed, Oct 2, 2024 at 11:16 AM [email protected] <[email protected]> wrote:
>
>> We are trying to process over 150TB data(streaming unbound) per day and
>> save them to BQ and it looks like storage api is not economical enough for
>> us.  I tried to use file upload but somehow it doesn't work and there are
>> not many documents for file upload method online. I have a few questions
>> regarding the file_upload method in streaming mode.
>> 1. How do I decide numOfFileShards? can I still reply on autosharding?
>> 2. I noticed the fileloads method requires much more memory, I'm not sure
>> if dataflow runner would keep all the data in memory before writing to
>> file? If so even one minute data is too much to be kept in memory and less
>> than one minute means would exceed the api quota. Is there a way to cap the
>> memory usage like write data to files before trigger file load job?
>> 3. I also noticed that if there is a file upload job failure, I don't get
>> the error message, so what can I do to handle the error, what is the best
>> practice in terms of error handling in file_upload method?
>>
>> Thanks!
>> Regards,
>> Siyuan
>>
>

Reply via email to