Re: Questions about file_upload method in BigQueryIO

Reuven Lax via user Tue, 08 Oct 2024 11:22:15 -0700

No. To use AVRO you need to use either the withAvroFormatFunction or
withAvroWriter.


On Tue, Oct 8, 2024 at 11:09 AM [email protected] <[email protected]> wrote:

> Btw, file_load doesn't support proto directly?
>
> On Tue, Oct 8, 2024 at 11:02 AM Reuven Lax <[email protected]> wrote:
>
>> I would try to use AVRO if possible - it tends to decrease the file size
>> by quite a lot, and might get you under the limit for a single load job
>> which is 11TB or 10,000 files (depending on the frequency  at which you are
>> triggering the loads). JSON tends to blow up the data size quite a bit.
>>
>> BTW - is this using the Dataflow runner? If so, Beam should never delete
>> the temp tables until the copy job is completed.
>>
>> On Tue, Oct 8, 2024 at 10:49 AM [email protected] <[email protected]>
>> wrote:
>>
>>> It is a streaming job
>>>
>>> On Tue, Oct 8, 2024 at 10:40 AM Reuven Lax <[email protected]> wrote:
>>>
>>>> Is this a batch of streaming job?
>>>>
>>>> On Tue, Oct 8, 2024 at 10:25 AM [email protected] <[email protected]>
>>>> wrote:
>>>>
>>>>> It looks like the COPY job failed because the TEMP table was removed. 
>>>>> @Reuven
>>>>> Lax <[email protected]>  Is that possible? Is there a way to avoid
>>>>> that. Or even better is there a way to force writing to destination table
>>>>> directly? Thanks!
>>>>>
>>>>> On Sun, Oct 6, 2024 at 12:35 PM Reuven Lax <[email protected]> wrote:
>>>>>
>>>>>> By default the file is in json format. You can provide a formatter to
>>>>>> allow it to be in AVRO format instead, which will be more efficient.
>>>>>>
>>>>>> The temp tables are only created if file sizes are too large for a
>>>>>> single load into BQ (if you use an AVRO formatter you might be able to
>>>>>> reduce file size enough to avoid this). In this case, Beam will issue a
>>>>>> copy job to copy all the temp tables to the final table.
>>>>>>
>>>>>> On Wed, Oct 2, 2024 at 2:42 PM [email protected] <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> @Reuven Lax <[email protected]>  I do see the file_upload create
>>>>>>> tons of temp tables, but when does BQ load temp tables to the final 
>>>>>>> table?
>>>>>>>
>>>>>>> On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> File load does not return per-row errors (unlike storage API which
>>>>>>>> does). Dataflow will generally retry the entire file load on error
>>>>>>>> (indefinitely for streaming and up to 3 times for batch). You can look 
>>>>>>>> at
>>>>>>>> the logs to find the specific error, however it can be tricky to 
>>>>>>>> associate
>>>>>>>> it with a specific row.
>>>>>>>>
>>>>>>>> Reuven
>>>>>>>>
>>>>>>>> On Wed, Oct 2, 2024 at 1:08 PM [email protected] <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Any best practice for error handling for file upload job?
>>>>>>>>>
>>>>>>>>> On Wed, Oct 2, 2024 at 1:04 PM [email protected] <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but
>>>>>>>>>> the storage api cost alone is too high for us, that's why we want to 
>>>>>>>>>> switch
>>>>>>>>>> to file upload
>>>>>>>>>>
>>>>>>>>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Have you checked
>>>>>>>>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery?
>>>>>>>>>>>
>>>>>>>>>>> autosharding is generally recommended. If the cost is the
>>>>>>>>>>> concern, have you checked STORAGE_API_AT_LEAST_ONCE?
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Oct 2, 2024 at 2:16 PM [email protected] <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> We are trying to process over 150TB data(streaming unbound) per
>>>>>>>>>>>> day and save them to BQ and it looks like storage api is not 
>>>>>>>>>>>> economical
>>>>>>>>>>>> enough for us.  I tried to use file upload but somehow it doesn't 
>>>>>>>>>>>> work and
>>>>>>>>>>>> there are not many documents for file upload method online. I have 
>>>>>>>>>>>> a few
>>>>>>>>>>>> questions regarding the file_upload method in streaming mode.
>>>>>>>>>>>> 1. How do I decide numOfFileShards? can I still reply on
>>>>>>>>>>>> autosharding?
>>>>>>>>>>>> 2. I noticed the fileloads method requires much more memory,
>>>>>>>>>>>> I'm not sure if dataflow runner would keep all the data in memory 
>>>>>>>>>>>> before
>>>>>>>>>>>> writing to file? If so even one minute data is too much to be kept 
>>>>>>>>>>>> in
>>>>>>>>>>>> memory and less than one minute means would exceed the api quota. 
>>>>>>>>>>>> Is there
>>>>>>>>>>>> a way to cap the memory usage like write data to files before 
>>>>>>>>>>>> trigger file
>>>>>>>>>>>> load job?
>>>>>>>>>>>> 3. I also noticed that if there is a file upload job failure, I
>>>>>>>>>>>> don't get the error message, so what can I do to handle the error, 
>>>>>>>>>>>> what is
>>>>>>>>>>>> the best practice in terms of error handling in file_upload method?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Siyuan
>>>>>>>>>>>>
>>>>>>>>>>>

Re: Questions about file_upload method in BigQueryIO

Reply via email to