It is a streaming job On Tue, Oct 8, 2024 at 10:40 AM Reuven Lax <[email protected]> wrote:
> Is this a batch of streaming job? > > On Tue, Oct 8, 2024 at 10:25 AM [email protected] <[email protected]> wrote: > >> It looks like the COPY job failed because the TEMP table was removed. @Reuven >> Lax <[email protected]> Is that possible? Is there a way to avoid >> that. Or even better is there a way to force writing to destination table >> directly? Thanks! >> >> On Sun, Oct 6, 2024 at 12:35 PM Reuven Lax <[email protected]> wrote: >> >>> By default the file is in json format. You can provide a formatter to >>> allow it to be in AVRO format instead, which will be more efficient. >>> >>> The temp tables are only created if file sizes are too large for a >>> single load into BQ (if you use an AVRO formatter you might be able to >>> reduce file size enough to avoid this). In this case, Beam will issue a >>> copy job to copy all the temp tables to the final table. >>> >>> On Wed, Oct 2, 2024 at 2:42 PM [email protected] <[email protected]> >>> wrote: >>> >>>> @Reuven Lax <[email protected]> I do see the file_upload create tons >>>> of temp tables, but when does BQ load temp tables to the final table? >>>> >>>> On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user < >>>> [email protected]> wrote: >>>> >>>>> File load does not return per-row errors (unlike storage API which >>>>> does). Dataflow will generally retry the entire file load on error >>>>> (indefinitely for streaming and up to 3 times for batch). You can look at >>>>> the logs to find the specific error, however it can be tricky to associate >>>>> it with a specific row. >>>>> >>>>> Reuven >>>>> >>>>> On Wed, Oct 2, 2024 at 1:08 PM [email protected] <[email protected]> >>>>> wrote: >>>>> >>>>>> Any best practice for error handling for file upload job? >>>>>> >>>>>> On Wed, Oct 2, 2024 at 1:04 PM [email protected] <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the >>>>>>> storage api cost alone is too high for us, that's why we want to switch >>>>>>> to >>>>>>> file upload >>>>>>> >>>>>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Have you checked >>>>>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery? >>>>>>>> >>>>>>>> autosharding is generally recommended. If the cost is the concern, >>>>>>>> have you checked STORAGE_API_AT_LEAST_ONCE? >>>>>>>> >>>>>>>> On Wed, Oct 2, 2024 at 2:16 PM [email protected] <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> We are trying to process over 150TB data(streaming unbound) per >>>>>>>>> day and save them to BQ and it looks like storage api is not >>>>>>>>> economical >>>>>>>>> enough for us. I tried to use file upload but somehow it doesn't >>>>>>>>> work and >>>>>>>>> there are not many documents for file upload method online. I have a >>>>>>>>> few >>>>>>>>> questions regarding the file_upload method in streaming mode. >>>>>>>>> 1. How do I decide numOfFileShards? can I still reply on >>>>>>>>> autosharding? >>>>>>>>> 2. I noticed the fileloads method requires much more memory, I'm >>>>>>>>> not sure if dataflow runner would keep all the data in memory before >>>>>>>>> writing to file? If so even one minute data is too much to be kept in >>>>>>>>> memory and less than one minute means would exceed the api quota. Is >>>>>>>>> there >>>>>>>>> a way to cap the memory usage like write data to files before trigger >>>>>>>>> file >>>>>>>>> load job? >>>>>>>>> 3. I also noticed that if there is a file upload job failure, I >>>>>>>>> don't get the error message, so what can I do to handle the error, >>>>>>>>> what is >>>>>>>>> the best practice in terms of error handling in file_upload method? >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> Regards, >>>>>>>>> Siyuan >>>>>>>>> >>>>>>>>
