It looks like the COPY job failed because the TEMP table was removed. @Reuven Lax <[email protected]> Is that possible? Is there a way to avoid that. Or even better is there a way to force writing to destination table directly? Thanks!
On Sun, Oct 6, 2024 at 12:35 PM Reuven Lax <[email protected]> wrote: > By default the file is in json format. You can provide a formatter to > allow it to be in AVRO format instead, which will be more efficient. > > The temp tables are only created if file sizes are too large for a single > load into BQ (if you use an AVRO formatter you might be able to reduce file > size enough to avoid this). In this case, Beam will issue a copy job to > copy all the temp tables to the final table. > > On Wed, Oct 2, 2024 at 2:42 PM [email protected] <[email protected]> wrote: > >> @Reuven Lax <[email protected]> I do see the file_upload create tons of >> temp tables, but when does BQ load temp tables to the final table? >> >> On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user <[email protected]> >> wrote: >> >>> File load does not return per-row errors (unlike storage API which >>> does). Dataflow will generally retry the entire file load on error >>> (indefinitely for streaming and up to 3 times for batch). You can look at >>> the logs to find the specific error, however it can be tricky to associate >>> it with a specific row. >>> >>> Reuven >>> >>> On Wed, Oct 2, 2024 at 1:08 PM [email protected] <[email protected]> >>> wrote: >>> >>>> Any best practice for error handling for file upload job? >>>> >>>> On Wed, Oct 2, 2024 at 1:04 PM [email protected] <[email protected]> >>>> wrote: >>>> >>>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the >>>>> storage api cost alone is too high for us, that's why we want to switch to >>>>> file upload >>>>> >>>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <[email protected]> >>>>> wrote: >>>>> >>>>>> Have you checked >>>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery? >>>>>> >>>>>> autosharding is generally recommended. If the cost is the concern, >>>>>> have you checked STORAGE_API_AT_LEAST_ONCE? >>>>>> >>>>>> On Wed, Oct 2, 2024 at 2:16 PM [email protected] <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> We are trying to process over 150TB data(streaming unbound) per day >>>>>>> and save them to BQ and it looks like storage api is not economical >>>>>>> enough for us. I tried to use file upload but somehow it doesn't work >>>>>>> and >>>>>>> there are not many documents for file upload method online. I have a few >>>>>>> questions regarding the file_upload method in streaming mode. >>>>>>> 1. How do I decide numOfFileShards? can I still reply on >>>>>>> autosharding? >>>>>>> 2. I noticed the fileloads method requires much more memory, I'm not >>>>>>> sure if dataflow runner would keep all the data in memory before >>>>>>> writing to >>>>>>> file? If so even one minute data is too much to be kept in memory and >>>>>>> less >>>>>>> than one minute means would exceed the api quota. Is there a way to cap >>>>>>> the >>>>>>> memory usage like write data to files before trigger file load job? >>>>>>> 3. I also noticed that if there is a file upload job failure, I >>>>>>> don't get the error message, so what can I do to handle the error, what >>>>>>> is >>>>>>> the best practice in terms of error handling in file_upload method? >>>>>>> >>>>>>> Thanks! >>>>>>> Regards, >>>>>>> Siyuan >>>>>>> >>>>>>
