In my experience with writing to BQ via BigQueryIO in the Java SDK, the
bottleneck tends to be disk I/O. The BigQueryIO logic requires several
shuffles that cause checkpointing even in the case of streaming inserts,
which in the Dataflow case means writing to disk. I assume the Python logic
is similar, but don't know for sure.

If this is the case for you, you may see significantly improved performance
by provisioning SSDs for your workers or by opting to use the Dataflow
streaming engine.

On Tue, Jul 14, 2020 at 9:50 AM Mark Kelly <[email protected]> wrote:

> We’re currently developing a streaming Dataflow pipeline using the latest
> version of the Python Beam SDK.
>
> The pipeline does a number of transformations/aggregations, before
> attempting to write to BigQuery. We're peaking at ~250 elements/sec going
> into the writeToBigQuery step, however, we're seeing v poor performance in
> the pipeline, needing to scale to a considerable number of workers, and
> often seeing the entire pipeline 'freeze' with throughput dropping to zero
> at all stages, for ~30 min periods.
>
> The number of unacked messages keeps growing (so it looks like the
> pipeline could never catch-up). The wall time on the WriteToBQ steps is
> considerably higher than the rest of the stages in the pipeline.
>
> If we run another version of the Dataflow job, removing the
> WriteToBigQuery step - performance is *dramatically* improved. System lag
> is minimal and the approx 1/3 of the number of vCPUs is required to keep on
> top of the incoming messages.
>
> Are there known limitations with WriteToBigQuery in the Python SDK? We
> have had our quota raised by Google, so limits on streaming inserts
> shouldn’t be an issue.
>

Reply via email to