Never mind, I found this thread on user list:
https://lists.apache.org/thread.html/raeb69afbd820fdf32b3cf0a273060b6b149f80fa49c7414a1bb60528%40%3Cuser.beam.apache.org%3E,
which answers my question.
On Mon, Jul 13, 2020 at 4:10 PM Kamil Wasilewski <
kamil.wasilew...@polidea.com> wrote:
> I'd like
We’re currently developing a streaming Dataflow pipeline using the latest
version of the Python Beam SDK.
The pipeline does a number of transformations/aggregations, before attempting
to write to BigQuery. We're peaking at ~250 elements/sec going into the
writeToBigQuery step, however, we're s
In my experience with writing to BQ via BigQueryIO in the Java SDK, the
bottleneck tends to be disk I/O. The BigQueryIO logic requires several
shuffles that cause checkpointing even in the case of streaming inserts,
which in the Dataflow case means writing to disk. I assume the Python logic
is simi
Having tested with both the streaming engine option, and without - I’m not
seeing any difference in performance.
As it happens, I’m seeing more underlying gRPC errors when using the
streaming-engine option, so have avoided it in the last few test runs
(although not sure if these errors are problem
In particular, the GCE docs have a nice reference for how I/O throughput
depends on both vCPU count and disk type/size:
https://cloud.google.com/compute/docs/disks/performance#cpu_count_size
That should help you choose which configurations to test.
On Tue, Jul 14, 2020 at 10:18 AM Mark Kelly wr
Thanks, however in this case, it looks like the issue may be elsewhere.
I’ve switched to SSD, and to instance types with a greater number of vCPU,
and I’m still seeing the same behaviour:
A burst of throughput at the start, then all CPUs are maxed. Looking at the
instance monitoring, disk I/O look