I would try to use AVRO if possible - it tends to decrease the file size by quite a lot, and might get you under the limit for a single load job which is 11TB or 10,000 files (depending on the frequency at which you are triggering the loads). JSON tends to blow up the data size quite a bit.
BTW - is this using the Dataflow runner? If so, Beam should never delete the temp tables until the copy job is completed. On Tue, Oct 8, 2024 at 10:49 AM [email protected] <[email protected]> wrote: > It is a streaming job > > On Tue, Oct 8, 2024 at 10:40 AM Reuven Lax <[email protected]> wrote: > >> Is this a batch of streaming job? >> >> On Tue, Oct 8, 2024 at 10:25 AM [email protected] <[email protected]> >> wrote: >> >>> It looks like the COPY job failed because the TEMP table was removed. >>> @Reuven >>> Lax <[email protected]> Is that possible? Is there a way to avoid >>> that. Or even better is there a way to force writing to destination table >>> directly? Thanks! >>> >>> On Sun, Oct 6, 2024 at 12:35 PM Reuven Lax <[email protected]> wrote: >>> >>>> By default the file is in json format. You can provide a formatter to >>>> allow it to be in AVRO format instead, which will be more efficient. >>>> >>>> The temp tables are only created if file sizes are too large for a >>>> single load into BQ (if you use an AVRO formatter you might be able to >>>> reduce file size enough to avoid this). In this case, Beam will issue a >>>> copy job to copy all the temp tables to the final table. >>>> >>>> On Wed, Oct 2, 2024 at 2:42 PM [email protected] <[email protected]> >>>> wrote: >>>> >>>>> @Reuven Lax <[email protected]> I do see the file_upload create tons >>>>> of temp tables, but when does BQ load temp tables to the final table? >>>>> >>>>> On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user < >>>>> [email protected]> wrote: >>>>> >>>>>> File load does not return per-row errors (unlike storage API which >>>>>> does). Dataflow will generally retry the entire file load on error >>>>>> (indefinitely for streaming and up to 3 times for batch). You can look at >>>>>> the logs to find the specific error, however it can be tricky to >>>>>> associate >>>>>> it with a specific row. >>>>>> >>>>>> Reuven >>>>>> >>>>>> On Wed, Oct 2, 2024 at 1:08 PM [email protected] <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Any best practice for error handling for file upload job? >>>>>>> >>>>>>> On Wed, Oct 2, 2024 at 1:04 PM [email protected] <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the >>>>>>>> storage api cost alone is too high for us, that's why we want to >>>>>>>> switch to >>>>>>>> file upload >>>>>>>> >>>>>>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Have you checked >>>>>>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery? >>>>>>>>> >>>>>>>>> autosharding is generally recommended. If the cost is the concern, >>>>>>>>> have you checked STORAGE_API_AT_LEAST_ONCE? >>>>>>>>> >>>>>>>>> On Wed, Oct 2, 2024 at 2:16 PM [email protected] <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> We are trying to process over 150TB data(streaming unbound) per >>>>>>>>>> day and save them to BQ and it looks like storage api is not >>>>>>>>>> economical >>>>>>>>>> enough for us. I tried to use file upload but somehow it doesn't >>>>>>>>>> work and >>>>>>>>>> there are not many documents for file upload method online. I have a >>>>>>>>>> few >>>>>>>>>> questions regarding the file_upload method in streaming mode. >>>>>>>>>> 1. How do I decide numOfFileShards? can I still reply on >>>>>>>>>> autosharding? >>>>>>>>>> 2. I noticed the fileloads method requires much more memory, I'm >>>>>>>>>> not sure if dataflow runner would keep all the data in memory before >>>>>>>>>> writing to file? If so even one minute data is too much to be kept in >>>>>>>>>> memory and less than one minute means would exceed the api quota. Is >>>>>>>>>> there >>>>>>>>>> a way to cap the memory usage like write data to files before >>>>>>>>>> trigger file >>>>>>>>>> load job? >>>>>>>>>> 3. I also noticed that if there is a file upload job failure, I >>>>>>>>>> don't get the error message, so what can I do to handle the error, >>>>>>>>>> what is >>>>>>>>>> the best practice in terms of error handling in file_upload method? >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> Regards, >>>>>>>>>> Siyuan >>>>>>>>>> >>>>>>>>>
