Hi, We have a datapipeline that produces ~400M datapoints each day. If we run it without storing, it finishes in a little over an hour. If we run it and store the datapoints in a MySQL database it takes several hours.
We are running on GCP dataflow, the MySQL instances are hosted GCP instances. We are using mysql-beam-connector <https://github.com/esakik/beam-mysql-connector>. The pipeline writes ~5000 datapoints per second. A couple of questions: - Does this throughput sound reasonable or could it be significantly improved by optimizing the database? - The pipeline runs several workers to write this out - and because it's a write operation they content for write access. Is it better to write out through just one worker and one connection? - Is it actually faster to write from the pipeline to pubsub or kafka or such and have a client on the other side which then writes in bulk? Thanks for any ideas or pointers (no, I'm by no means an experienced DBA!!!) Mark