Re: Questions about file_upload method in BigQueryIO

[email protected] Tue, 08 Oct 2024 11:00:14 -0700

It is a streaming job

On Tue, Oct 8, 2024 at 10:40 AM Reuven Lax <[email protected]> wrote:


> Is this a batch of streaming job?
>
> On Tue, Oct 8, 2024 at 10:25 AM [email protected] <[email protected]> wrote:
>
>> It looks like the COPY job failed because the TEMP table was removed. @Reuven
>> Lax <[email protected]>  Is that possible? Is there a way to avoid
>> that. Or even better is there a way to force writing to destination table
>> directly? Thanks!
>>
>> On Sun, Oct 6, 2024 at 12:35 PM Reuven Lax <[email protected]> wrote:
>>
>>> By default the file is in json format. You can provide a formatter to
>>> allow it to be in AVRO format instead, which will be more efficient.
>>>
>>> The temp tables are only created if file sizes are too large for a
>>> single load into BQ (if you use an AVRO formatter you might be able to
>>> reduce file size enough to avoid this). In this case, Beam will issue a
>>> copy job to copy all the temp tables to the final table.
>>>
>>> On Wed, Oct 2, 2024 at 2:42 PM [email protected] <[email protected]>
>>> wrote:
>>>
>>>> @Reuven Lax <[email protected]>  I do see the file_upload create tons
>>>> of temp tables, but when does BQ load temp tables to the final table?
>>>>
>>>> On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user <
>>>> [email protected]> wrote:
>>>>
>>>>> File load does not return per-row errors (unlike storage API which
>>>>> does). Dataflow will generally retry the entire file load on error
>>>>> (indefinitely for streaming and up to 3 times for batch). You can look at
>>>>> the logs to find the specific error, however it can be tricky to associate
>>>>> it with a specific row.
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Wed, Oct 2, 2024 at 1:08 PM [email protected] <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Any best practice for error handling for file upload job?
>>>>>>
>>>>>> On Wed, Oct 2, 2024 at 1:04 PM [email protected] <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the
>>>>>>> storage api cost alone is too high for us, that's why we want to switch 
>>>>>>> to
>>>>>>> file upload
>>>>>>>
>>>>>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Have you checked
>>>>>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery?
>>>>>>>>
>>>>>>>> autosharding is generally recommended. If the cost is the concern,
>>>>>>>> have you checked STORAGE_API_AT_LEAST_ONCE?
>>>>>>>>
>>>>>>>> On Wed, Oct 2, 2024 at 2:16 PM [email protected] <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> We are trying to process over 150TB data(streaming unbound) per
>>>>>>>>> day and save them to BQ and it looks like storage api is not 
>>>>>>>>> economical
>>>>>>>>> enough for us.  I tried to use file upload but somehow it doesn't 
>>>>>>>>> work and
>>>>>>>>> there are not many documents for file upload method online. I have a 
>>>>>>>>> few
>>>>>>>>> questions regarding the file_upload method in streaming mode.
>>>>>>>>> 1. How do I decide numOfFileShards? can I still reply on
>>>>>>>>> autosharding?
>>>>>>>>> 2. I noticed the fileloads method requires much more memory, I'm
>>>>>>>>> not sure if dataflow runner would keep all the data in memory before
>>>>>>>>> writing to file? If so even one minute data is too much to be kept in
>>>>>>>>> memory and less than one minute means would exceed the api quota. Is 
>>>>>>>>> there
>>>>>>>>> a way to cap the memory usage like write data to files before trigger 
>>>>>>>>> file
>>>>>>>>> load job?
>>>>>>>>> 3. I also noticed that if there is a file upload job failure, I
>>>>>>>>> don't get the error message, so what can I do to handle the error, 
>>>>>>>>> what is
>>>>>>>>> the best practice in terms of error handling in file_upload method?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>> Regards,
>>>>>>>>> Siyuan
>>>>>>>>>
>>>>>>>>

Re: Questions about file_upload method in BigQueryIO

Reply via email to