Any best practice for error handling for file upload job?

On Wed, Oct 2, 2024 at 1:04 PM [email protected] <[email protected]> wrote:

> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the storage
> api cost alone is too high for us, that's why we want to switch to file
> upload
>
> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <[email protected]>
> wrote:
>
>> Have you checked
>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery?
>>
>> autosharding is generally recommended. If the cost is the concern, have
>> you checked STORAGE_API_AT_LEAST_ONCE?
>>
>> On Wed, Oct 2, 2024 at 2:16 PM [email protected] <[email protected]> wrote:
>>
>>> We are trying to process over 150TB data(streaming unbound) per day and
>>> save them to BQ and it looks like storage api is not economical enough for
>>> us.  I tried to use file upload but somehow it doesn't work and there are
>>> not many documents for file upload method online. I have a few questions
>>> regarding the file_upload method in streaming mode.
>>> 1. How do I decide numOfFileShards? can I still reply on autosharding?
>>> 2. I noticed the fileloads method requires much more memory, I'm not
>>> sure if dataflow runner would keep all the data in memory before writing to
>>> file? If so even one minute data is too much to be kept in memory and less
>>> than one minute means would exceed the api quota. Is there a way to cap the
>>> memory usage like write data to files before trigger file load job?
>>> 3. I also noticed that if there is a file upload job failure, I don't
>>> get the error message, so what can I do to handle the error, what is the
>>> best practice in terms of error handling in file_upload method?
>>>
>>> Thanks!
>>> Regards,
>>> Siyuan
>>>
>>

Reply via email to