Is this a batch of streaming job? On Tue, Oct 8, 2024 at 10:25 AM [email protected] <[email protected]> wrote:
> It looks like the COPY job failed because the TEMP table was removed. @Reuven > Lax <[email protected]> Is that possible? Is there a way to avoid > that. Or even better is there a way to force writing to destination table > directly? Thanks! > > On Sun, Oct 6, 2024 at 12:35 PM Reuven Lax <[email protected]> wrote: > >> By default the file is in json format. You can provide a formatter to >> allow it to be in AVRO format instead, which will be more efficient. >> >> The temp tables are only created if file sizes are too large for a single >> load into BQ (if you use an AVRO formatter you might be able to reduce file >> size enough to avoid this). In this case, Beam will issue a copy job to >> copy all the temp tables to the final table. >> >> On Wed, Oct 2, 2024 at 2:42 PM [email protected] <[email protected]> wrote: >> >>> @Reuven Lax <[email protected]> I do see the file_upload create tons of >>> temp tables, but when does BQ load temp tables to the final table? >>> >>> On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user <[email protected]> >>> wrote: >>> >>>> File load does not return per-row errors (unlike storage API which >>>> does). Dataflow will generally retry the entire file load on error >>>> (indefinitely for streaming and up to 3 times for batch). You can look at >>>> the logs to find the specific error, however it can be tricky to associate >>>> it with a specific row. >>>> >>>> Reuven >>>> >>>> On Wed, Oct 2, 2024 at 1:08 PM [email protected] <[email protected]> >>>> wrote: >>>> >>>>> Any best practice for error handling for file upload job? >>>>> >>>>> On Wed, Oct 2, 2024 at 1:04 PM [email protected] <[email protected]> >>>>> wrote: >>>>> >>>>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the >>>>>> storage api cost alone is too high for us, that's why we want to switch >>>>>> to >>>>>> file upload >>>>>> >>>>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Have you checked >>>>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery? >>>>>>> >>>>>>> autosharding is generally recommended. If the cost is the concern, >>>>>>> have you checked STORAGE_API_AT_LEAST_ONCE? >>>>>>> >>>>>>> On Wed, Oct 2, 2024 at 2:16 PM [email protected] <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> We are trying to process over 150TB data(streaming unbound) per day >>>>>>>> and save them to BQ and it looks like storage api is not economical >>>>>>>> enough for us. I tried to use file upload but somehow it doesn't work >>>>>>>> and >>>>>>>> there are not many documents for file upload method online. I have a >>>>>>>> few >>>>>>>> questions regarding the file_upload method in streaming mode. >>>>>>>> 1. How do I decide numOfFileShards? can I still reply on >>>>>>>> autosharding? >>>>>>>> 2. I noticed the fileloads method requires much more memory, I'm >>>>>>>> not sure if dataflow runner would keep all the data in memory before >>>>>>>> writing to file? If so even one minute data is too much to be kept in >>>>>>>> memory and less than one minute means would exceed the api quota. Is >>>>>>>> there >>>>>>>> a way to cap the memory usage like write data to files before trigger >>>>>>>> file >>>>>>>> load job? >>>>>>>> 3. I also noticed that if there is a file upload job failure, I >>>>>>>> don't get the error message, so what can I do to handle the error, >>>>>>>> what is >>>>>>>> the best practice in terms of error handling in file_upload method? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> Regards, >>>>>>>> Siyuan >>>>>>>> >>>>>>>
