Re: Python SDK ReadFromKafka: Timeout expired while fetching topic metadata

2020-07-14 Thread Kamil Wasilewski
Never mind, I found this thread on user list: https://lists.apache.org/thread.html/raeb69afbd820fdf32b3cf0a273060b6b149f80fa49c7414a1bb60528%40%3Cuser.beam.apache.org%3E, which answers my question. On Mon, Jul 13, 2020 at 4:10 PM Kamil Wasilewski < kamil.wasilew...@polidea.com> wrote: > I'd like

WriteToBigQuery - performance issues?

2020-07-14 Thread Mark Kelly
We’re currently developing a streaming Dataflow pipeline using the latest version of the Python Beam SDK. The pipeline does a number of transformations/aggregations, before attempting to write to BigQuery. We're peaking at ~250 elements/sec going into the writeToBigQuery step, however, we're s

Re: WriteToBigQuery - performance issues?

2020-07-14 Thread Jeff Klukas
In my experience with writing to BQ via BigQueryIO in the Java SDK, the bottleneck tends to be disk I/O. The BigQueryIO logic requires several shuffles that cause checkpointing even in the case of streaming inserts, which in the Dataflow case means writing to disk. I assume the Python logic is simi

Re: WriteToBigQuery - performance issues?

2020-07-14 Thread Mark Kelly
Having tested with both the streaming engine option, and without - I’m not seeing any difference in performance. As it happens, I’m seeing more underlying gRPC errors when using the streaming-engine option, so have avoided it in the last few test runs (although not sure if these errors are problem

Re: WriteToBigQuery - performance issues?

2020-07-14 Thread Jeff Klukas
In particular, the GCE docs have a nice reference for how I/O throughput depends on both vCPU count and disk type/size: https://cloud.google.com/compute/docs/disks/performance#cpu_count_size That should help you choose which configurations to test. On Tue, Jul 14, 2020 at 10:18 AM Mark Kelly wr

Re: WriteToBigQuery - performance issues?

2020-07-14 Thread Mark Kelly
Thanks, however in this case, it looks like the issue may be elsewhere. I’ve switched to SSD, and to instance types with a greater number of vCPU, and I’m still seeing the same behaviour: A burst of throughput at the start, then all CPUs are maxed. Looking at the instance monitoring, disk I/O look