Thanks, however in this case, it looks like the issue may be elsewhere.
I’ve switched to SSD, and to instance types with a greater number of vCPU,
and I’m still seeing the same behaviour:

A burst of throughput at the start, then all CPUs are maxed. Looking at the
instance monitoring, disk I/O looks minimal. There’s a spike at the start,
but after that we’re looking at ~500kb/sec with the occasional blip.



-- 
Mark Kelly
Sent with Airmail

On 14 July 2020 at 15:28:13, Jeff Klukas ([email protected]) wrote:

In particular, the GCE docs have a nice reference for how I/O throughput
depends on both vCPU count and disk type/size:

https://cloud.google.com/compute/docs/disks/performance#cpu_count_size

That should help you choose which configurations to test.

On Tue, Jul 14, 2020 at 10:18 AM Mark Kelly <[email protected]> wrote:

> Having tested with both the streaming engine option, and without - I’m not
> seeing any difference in performance.
>
> As it happens, I’m seeing more underlying gRPC errors when using the
> streaming-engine option, so have avoided it in the last few test runs
> (although not sure if these errors are problematic)
>
> I’ll definitely give the SSD option a shot.
>
> Thanks.
>
> --
> Mark Kelly
> Sent with Airmail
>
> On 14 July 2020 at 15:12:46, Jeff Klukas ([email protected]) wrote:
>
> In my experience with writing to BQ via BigQueryIO in the Java SDK, the
> bottleneck tends to be disk I/O. The BigQueryIO logic requires several
> shuffles that cause checkpointing even in the case of streaming inserts,
> which in the Dataflow case means writing to disk. I assume the Python logic
> is similar, but don't know for sure.
>
> If this is the case for you, you may see significantly improved
> performance by provisioning SSDs for your workers or by opting to use the
> Dataflow streaming engine.
>
> On Tue, Jul 14, 2020 at 9:50 AM Mark Kelly <[email protected]> wrote:
>
>> We’re currently developing a streaming Dataflow pipeline using the latest
>> version of the Python Beam SDK.
>>
>> The pipeline does a number of transformations/aggregations, before
>> attempting to write to BigQuery. We're peaking at ~250 elements/sec going
>> into the writeToBigQuery step, however, we're seeing v poor performance in
>> the pipeline, needing to scale to a considerable number of workers, and
>> often seeing the entire pipeline 'freeze' with throughput dropping to zero
>> at all stages, for ~30 min periods.
>>
>> The number of unacked messages keeps growing (so it looks like the
>> pipeline could never catch-up). The wall time on the WriteToBQ steps is
>> considerably higher than the rest of the stages in the pipeline.
>>
>> If we run another version of the Dataflow job, removing the
>> WriteToBigQuery step - performance is *dramatically* improved. System lag
>> is minimal and the approx 1/3 of the number of vCPUs is required to keep on
>> top of the incoming messages.
>>
>> Are there known limitations with WriteToBigQuery in the Python SDK? We
>> have had our quota raised by Google, so limits on streaming inserts
>> shouldn’t be an issue.
>>
>

Reply via email to