Re: Questions about file_upload method in BigQueryIO

[email protected] Mon, 07 Oct 2024 15:55:49 -0700

Our requirement is to process 150tb data every day. I don't think AVRO
format would make huge difference? And also is there any example of how to
debug, if the COPY/LOAD job fails. I don't have any log detail to know what
went wrong in the job. It simply tells me job fails and I tried use bq show
-j or just from UI, I don't even see those jobs. And also if the job fails,
how do dataflow clean up those temp tables?


On Sun, Oct 6, 2024 at 12:35 PM Reuven Lax <[email protected]> wrote:

> By default the file is in json format. You can provide a formatter to
> allow it to be in AVRO format instead, which will be more efficient.
>
> The temp tables are only created if file sizes are too large for a single
> load into BQ (if you use an AVRO formatter you might be able to reduce file
> size enough to avoid this). In this case, Beam will issue a copy job to
> copy all the temp tables to the final table.
>
> On Wed, Oct 2, 2024 at 2:42 PM [email protected] <[email protected]> wrote:
>
>> @Reuven Lax <[email protected]>  I do see the file_upload create tons of
>> temp tables, but when does BQ load temp tables to the final table?
>>
>> On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user <[email protected]>
>> wrote:
>>
>>> File load does not return per-row errors (unlike storage API which
>>> does). Dataflow will generally retry the entire file load on error
>>> (indefinitely for streaming and up to 3 times for batch). You can look at
>>> the logs to find the specific error, however it can be tricky to associate
>>> it with a specific row.
>>>
>>> Reuven
>>>
>>> On Wed, Oct 2, 2024 at 1:08 PM [email protected] <[email protected]>
>>> wrote:
>>>
>>>> Any best practice for error handling for file upload job?
>>>>
>>>> On Wed, Oct 2, 2024 at 1:04 PM [email protected] <[email protected]>
>>>> wrote:
>>>>
>>>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the
>>>>> storage api cost alone is too high for us, that's why we want to switch to
>>>>> file upload
>>>>>
>>>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Have you checked
>>>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery?
>>>>>>
>>>>>> autosharding is generally recommended. If the cost is the concern,
>>>>>> have you checked STORAGE_API_AT_LEAST_ONCE?
>>>>>>
>>>>>> On Wed, Oct 2, 2024 at 2:16 PM [email protected] <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> We are trying to process over 150TB data(streaming unbound) per day
>>>>>>> and save them to BQ and it looks like storage api is not economical
>>>>>>> enough for us.  I tried to use file upload but somehow it doesn't work 
>>>>>>> and
>>>>>>> there are not many documents for file upload method online. I have a few
>>>>>>> questions regarding the file_upload method in streaming mode.
>>>>>>> 1. How do I decide numOfFileShards? can I still reply on
>>>>>>> autosharding?
>>>>>>> 2. I noticed the fileloads method requires much more memory, I'm not
>>>>>>> sure if dataflow runner would keep all the data in memory before 
>>>>>>> writing to
>>>>>>> file? If so even one minute data is too much to be kept in memory and 
>>>>>>> less
>>>>>>> than one minute means would exceed the api quota. Is there a way to cap 
>>>>>>> the
>>>>>>> memory usage like write data to files before trigger file load job?
>>>>>>> 3. I also noticed that if there is a file upload job failure, I
>>>>>>> don't get the error message, so what can I do to handle the error, what 
>>>>>>> is
>>>>>>> the best practice in terms of error handling in file_upload method?
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Regards,
>>>>>>> Siyuan
>>>>>>>
>>>>>>

Re: Questions about file_upload method in BigQueryIO

Reply via email to