Hi,

We have a datapipeline that produces ~400M datapoints each day. If we run
it without storing, it finishes in a little over an hour. If we run it and
store the datapoints in a MySQL database it takes several hours.

We are running on GCP dataflow, the MySQL instances are hosted GCP
instances. We are using mysql-beam-connector
<https://github.com/esakik/beam-mysql-connector>.

The pipeline writes ~5000 datapoints per second.

A couple of questions:

   - Does this throughput sound reasonable or could it be significantly
   improved by optimizing the database?
   - The pipeline runs several workers to write this out - and because it's
   a write operation they content for write access. Is it better to write out
   through just one worker and one connection?
   - Is it actually faster to write from the pipeline to pubsub or kafka or
   such and have a client on the other side which then writes in bulk?

Thanks for any ideas or pointers (no, I'm by no means an experienced DBA!!!)

     Mark

Reply via email to