No. To use AVRO you need to use either the withAvroFormatFunction or withAvroWriter.
On Tue, Oct 8, 2024 at 11:09 AM [email protected] <[email protected]> wrote: > Btw, file_load doesn't support proto directly? > > On Tue, Oct 8, 2024 at 11:02 AM Reuven Lax <[email protected]> wrote: > >> I would try to use AVRO if possible - it tends to decrease the file size >> by quite a lot, and might get you under the limit for a single load job >> which is 11TB or 10,000 files (depending on the frequency at which you are >> triggering the loads). JSON tends to blow up the data size quite a bit. >> >> BTW - is this using the Dataflow runner? If so, Beam should never delete >> the temp tables until the copy job is completed. >> >> On Tue, Oct 8, 2024 at 10:49 AM [email protected] <[email protected]> >> wrote: >> >>> It is a streaming job >>> >>> On Tue, Oct 8, 2024 at 10:40 AM Reuven Lax <[email protected]> wrote: >>> >>>> Is this a batch of streaming job? >>>> >>>> On Tue, Oct 8, 2024 at 10:25 AM [email protected] <[email protected]> >>>> wrote: >>>> >>>>> It looks like the COPY job failed because the TEMP table was removed. >>>>> @Reuven >>>>> Lax <[email protected]> Is that possible? Is there a way to avoid >>>>> that. Or even better is there a way to force writing to destination table >>>>> directly? Thanks! >>>>> >>>>> On Sun, Oct 6, 2024 at 12:35 PM Reuven Lax <[email protected]> wrote: >>>>> >>>>>> By default the file is in json format. You can provide a formatter to >>>>>> allow it to be in AVRO format instead, which will be more efficient. >>>>>> >>>>>> The temp tables are only created if file sizes are too large for a >>>>>> single load into BQ (if you use an AVRO formatter you might be able to >>>>>> reduce file size enough to avoid this). In this case, Beam will issue a >>>>>> copy job to copy all the temp tables to the final table. >>>>>> >>>>>> On Wed, Oct 2, 2024 at 2:42 PM [email protected] <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> @Reuven Lax <[email protected]> I do see the file_upload create >>>>>>> tons of temp tables, but when does BQ load temp tables to the final >>>>>>> table? >>>>>>> >>>>>>> On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> File load does not return per-row errors (unlike storage API which >>>>>>>> does). Dataflow will generally retry the entire file load on error >>>>>>>> (indefinitely for streaming and up to 3 times for batch). You can look >>>>>>>> at >>>>>>>> the logs to find the specific error, however it can be tricky to >>>>>>>> associate >>>>>>>> it with a specific row. >>>>>>>> >>>>>>>> Reuven >>>>>>>> >>>>>>>> On Wed, Oct 2, 2024 at 1:08 PM [email protected] <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Any best practice for error handling for file upload job? >>>>>>>>> >>>>>>>>> On Wed, Oct 2, 2024 at 1:04 PM [email protected] <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but >>>>>>>>>> the storage api cost alone is too high for us, that's why we want to >>>>>>>>>> switch >>>>>>>>>> to file upload >>>>>>>>>> >>>>>>>>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Have you checked >>>>>>>>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery? >>>>>>>>>>> >>>>>>>>>>> autosharding is generally recommended. If the cost is the >>>>>>>>>>> concern, have you checked STORAGE_API_AT_LEAST_ONCE? >>>>>>>>>>> >>>>>>>>>>> On Wed, Oct 2, 2024 at 2:16 PM [email protected] < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> We are trying to process over 150TB data(streaming unbound) per >>>>>>>>>>>> day and save them to BQ and it looks like storage api is not >>>>>>>>>>>> economical >>>>>>>>>>>> enough for us. I tried to use file upload but somehow it doesn't >>>>>>>>>>>> work and >>>>>>>>>>>> there are not many documents for file upload method online. I have >>>>>>>>>>>> a few >>>>>>>>>>>> questions regarding the file_upload method in streaming mode. >>>>>>>>>>>> 1. How do I decide numOfFileShards? can I still reply on >>>>>>>>>>>> autosharding? >>>>>>>>>>>> 2. I noticed the fileloads method requires much more memory, >>>>>>>>>>>> I'm not sure if dataflow runner would keep all the data in memory >>>>>>>>>>>> before >>>>>>>>>>>> writing to file? If so even one minute data is too much to be kept >>>>>>>>>>>> in >>>>>>>>>>>> memory and less than one minute means would exceed the api quota. >>>>>>>>>>>> Is there >>>>>>>>>>>> a way to cap the memory usage like write data to files before >>>>>>>>>>>> trigger file >>>>>>>>>>>> load job? >>>>>>>>>>>> 3. I also noticed that if there is a file upload job failure, I >>>>>>>>>>>> don't get the error message, so what can I do to handle the error, >>>>>>>>>>>> what is >>>>>>>>>>>> the best practice in terms of error handling in file_upload method? >>>>>>>>>>>> >>>>>>>>>>>> Thanks! >>>>>>>>>>>> Regards, >>>>>>>>>>>> Siyuan >>>>>>>>>>>> >>>>>>>>>>>
