Our requirement is to process 150tb data every day. I don't think AVRO format would make huge difference? And also is there any example of how to debug, if the COPY/LOAD job fails. I don't have any log detail to know what went wrong in the job. It simply tells me job fails and I tried use bq show -j or just from UI, I don't even see those jobs. And also if the job fails, how do dataflow clean up those temp tables?
On Sun, Oct 6, 2024 at 12:35 PM Reuven Lax <[email protected]> wrote: > By default the file is in json format. You can provide a formatter to > allow it to be in AVRO format instead, which will be more efficient. > > The temp tables are only created if file sizes are too large for a single > load into BQ (if you use an AVRO formatter you might be able to reduce file > size enough to avoid this). In this case, Beam will issue a copy job to > copy all the temp tables to the final table. > > On Wed, Oct 2, 2024 at 2:42 PM [email protected] <[email protected]> wrote: > >> @Reuven Lax <[email protected]> I do see the file_upload create tons of >> temp tables, but when does BQ load temp tables to the final table? >> >> On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user <[email protected]> >> wrote: >> >>> File load does not return per-row errors (unlike storage API which >>> does). Dataflow will generally retry the entire file load on error >>> (indefinitely for streaming and up to 3 times for batch). You can look at >>> the logs to find the specific error, however it can be tricky to associate >>> it with a specific row. >>> >>> Reuven >>> >>> On Wed, Oct 2, 2024 at 1:08 PM [email protected] <[email protected]> >>> wrote: >>> >>>> Any best practice for error handling for file upload job? >>>> >>>> On Wed, Oct 2, 2024 at 1:04 PM [email protected] <[email protected]> >>>> wrote: >>>> >>>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the >>>>> storage api cost alone is too high for us, that's why we want to switch to >>>>> file upload >>>>> >>>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <[email protected]> >>>>> wrote: >>>>> >>>>>> Have you checked >>>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery? >>>>>> >>>>>> autosharding is generally recommended. If the cost is the concern, >>>>>> have you checked STORAGE_API_AT_LEAST_ONCE? >>>>>> >>>>>> On Wed, Oct 2, 2024 at 2:16 PM [email protected] <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> We are trying to process over 150TB data(streaming unbound) per day >>>>>>> and save them to BQ and it looks like storage api is not economical >>>>>>> enough for us. I tried to use file upload but somehow it doesn't work >>>>>>> and >>>>>>> there are not many documents for file upload method online. I have a few >>>>>>> questions regarding the file_upload method in streaming mode. >>>>>>> 1. How do I decide numOfFileShards? can I still reply on >>>>>>> autosharding? >>>>>>> 2. I noticed the fileloads method requires much more memory, I'm not >>>>>>> sure if dataflow runner would keep all the data in memory before >>>>>>> writing to >>>>>>> file? If so even one minute data is too much to be kept in memory and >>>>>>> less >>>>>>> than one minute means would exceed the api quota. Is there a way to cap >>>>>>> the >>>>>>> memory usage like write data to files before trigger file load job? >>>>>>> 3. I also noticed that if there is a file upload job failure, I >>>>>>> don't get the error message, so what can I do to handle the error, what >>>>>>> is >>>>>>> the best practice in terms of error handling in file_upload method? >>>>>>> >>>>>>> Thanks! >>>>>>> Regards, >>>>>>> Siyuan >>>>>>> >>>>>>
