In particular, the GCE docs have a nice reference for how I/O throughput depends on both vCPU count and disk type/size:
https://cloud.google.com/compute/docs/disks/performance#cpu_count_size That should help you choose which configurations to test. On Tue, Jul 14, 2020 at 10:18 AM Mark Kelly <[email protected]> wrote: > Having tested with both the streaming engine option, and without - I’m not > seeing any difference in performance. > > As it happens, I’m seeing more underlying gRPC errors when using the > streaming-engine option, so have avoided it in the last few test runs > (although not sure if these errors are problematic) > > I’ll definitely give the SSD option a shot. > > Thanks. > > -- > Mark Kelly > Sent with Airmail > > On 14 July 2020 at 15:12:46, Jeff Klukas ([email protected]) wrote: > > In my experience with writing to BQ via BigQueryIO in the Java SDK, the > bottleneck tends to be disk I/O. The BigQueryIO logic requires several > shuffles that cause checkpointing even in the case of streaming inserts, > which in the Dataflow case means writing to disk. I assume the Python logic > is similar, but don't know for sure. > > If this is the case for you, you may see significantly improved > performance by provisioning SSDs for your workers or by opting to use the > Dataflow streaming engine. > > On Tue, Jul 14, 2020 at 9:50 AM Mark Kelly <[email protected]> wrote: > >> We’re currently developing a streaming Dataflow pipeline using the latest >> version of the Python Beam SDK. >> >> The pipeline does a number of transformations/aggregations, before >> attempting to write to BigQuery. We're peaking at ~250 elements/sec going >> into the writeToBigQuery step, however, we're seeing v poor performance in >> the pipeline, needing to scale to a considerable number of workers, and >> often seeing the entire pipeline 'freeze' with throughput dropping to zero >> at all stages, for ~30 min periods. >> >> The number of unacked messages keeps growing (so it looks like the >> pipeline could never catch-up). The wall time on the WriteToBQ steps is >> considerably higher than the rest of the stages in the pipeline. >> >> If we run another version of the Dataflow job, removing the >> WriteToBigQuery step - performance is *dramatically* improved. System lag >> is minimal and the approx 1/3 of the number of vCPUs is required to keep on >> top of the incoming messages. >> >> Are there known limitations with WriteToBigQuery in the Python SDK? We >> have had our quota raised by Google, so limits on streaming inserts >> shouldn’t be an issue. >> >
