We are trying to process over 150TB data(streaming unbound) per day and
save them to BQ and it looks like storage api is not economical enough for
us.  I tried to use file upload but somehow it doesn't work and there are
not many documents for file upload method online. I have a few questions
regarding the file_upload method in streaming mode.
1. How do I decide numOfFileShards? can I still reply on autosharding?
2. I noticed the fileloads method requires much more memory, I'm not sure
if dataflow runner would keep all the data in memory before writing to
file? If so even one minute data is too much to be kept in memory and less
than one minute means would exceed the api quota. Is there a way to cap the
memory usage like write data to files before trigger file load job?
3. I also noticed that if there is a file upload job failure, I don't get
the error message, so what can I do to handle the error, what is the best
practice in terms of error handling in file_upload method?

Thanks!
Regards,
Siyuan

Reply via email to