Re: Writing to partitioned BigQuery tables

Radek Stankiewicz via user Sat, 05 Apr 2025 10:32:26 -0700

Hi Lina,
thanks for confirming that you are using write_truncate - that's a bummer
as it's not possible to truncate with storage_write_api, it's append only.
As it is a batch I imagine your job could have a step where you truncate
the table to overcome this limitation (e.g. via DML).
Have in mind that storage write api has 2 methods - at least once and
exactly once. For exactly once I would recommend control amount of streams
- withNumStorageWriteApiStreams as you may deplete the quota quickly.
Storage Write API is using cross language transform - for dataflow it will
work right away - let me know how you run your pipeline.


Radek







On Wed, Mar 19, 2025 at 6:12 PM Lina Mårtensson <lina@camus.energy> wrote:

> Thanks Radek!
>
> I didn't realize that writing is done with a copy job - then I understand
> why we need to configure partitioning as well. And that all makes sense.
>
> We haven't tried the storage write API - that wasn't available for Python
> yet when we started doing this. I will take a look at it and see if that
> might work for us and get back to you. Would using the storage write API
> allow us to write to a partitioned table without specifying so?
> I should mention as well that streaming the data would be out of the
> question, since we need to use mutating DML statements as well.
>
> Our call looks like this:
> "Write to BigQuery" >> WriteToBigQuery(
>
>     project=self.project_id,
>
>     table=self.get_table_name,
>
>     dataset=self.dataset,
>     schema=self.get_schema,
>
>     schema_side_inputs=(beam.pvalue.AsDict(types_for_schema),),
>
>     create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
>     write_disposition=(
>
>         BigQueryDisposition.WRITE_TRUNCATE
>
>         if self.overwrite
>
>         else BigQueryDisposition.WRITE_APPEND
>
>     ),
>     additional_bq_parameters={
>         "schemaUpdateOptions": ["ALLOW_FIELD_ADDITION"],
>     }
> )
>
> On Wed, Mar 19, 2025 at 3:12 AM Radek Stankiewicz <radosl...@google.com>
> wrote:
>
>> hi Lina,
>> there are multiple reasons why copy job is used with temporary table;
>> - you may be using dynamic destinations
>> - you are loading lots of data, probably with truncate
>> This way we ensure atomicity as we can trigger copy from multiple temp
>> tables into one final table.
>> Can you confirm or paste a snippet how you configured
>> pache_beam.io.gcp.bigquery.WriteToBigQuery ?
>>
>> BigQuery doesn't allow copying non partitioned tables into partitioned
>> tables - it's a BQ limitation.
>>
>> have you tried other loading methods .e.g storage write api?
>>
>> Radek
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Mar 19, 2025 at 5:42 AM Lina Mårtensson via user <
>> user@beam.apache.org> wrote:
>>
>>> Hi,
>>>
>>> We have, by now, a large set of different Beam jobs all written in
>>> Python that all write to a set of BigQuery tables that more or less behave
>>> the same way in a single dataset. These tables aren't partitioned at all,
>>> but going forward, we need them to be.
>>>
>>> I partitioned a single table to start with, and was very surprised to
>>> find that a Beam job that wrote to it couldn't do so:
>>>
>>> Failed to copy Non partitioned table to Column partitioned table: not
>>> supported.
>>>
>>>
>>> We have a bunch of pre-created tables, and I would've thought I could
>>> just keep writing to those without changing settings even after setting up
>>> partitioning on them. It doesn't seem to matter whether the
>>> create_disposition is CREATE_IF_NEEDED or CREATE_NEVER.
>>> It does work when I set the additional_bq_parameters to add
>>> time_partitioning, but it would be a huge undertaking not only to
>>> update all of our currently running jobs across many projects, but also to
>>> make sure to synchronize these changes with updating the underlying
>>> BigQuery tables. And it doesn't seem like it should be necessary to specify
>>> if we're not creating any new tables?
>>>
>>> Is there any way to just write the data we have to pre-created,
>>> partitioned tables without having to set time_partitioning in
>>> additional_bq_parameters, or potentially if there's some other
>>> recommended way to solve this problem?
>>>
>>> Thanks,
>>> -Lina
>>>
>>

Re: Writing to partitioned BigQuery tables

Reply via email to