Hi Mark - we faced a similar problem at my company (we sink our data into BigQuery first from Dataflow and then run a PSQL COPY operation to import in bulk). Moving to bulk can help but the bottleneck is more likely going to be your MySQL instance itself, so you might have to put your DBA hat on. The I/O itself is not too bad - table constraints and indexes will generally be the source of the main slowdowns. This might not be feasible for you but we actually disable constraints/keys before running a batch import and then reapply them after the insert is complete. Re: PubSub - Yes, you could decouple the fate of the dataflow job from the MySQL ingest task by writing to pubsub as you suggest, but you're still going to face the same rate limits (but switching to a queue-based approach to this problem has its advantages).
Best, -C On Wed, May 18, 2022 at 9:11 AM Yomal de Silva <yomal.prav...@gmail.com> wrote: > Hi Mark, > Writing to the db in bulk would be the first step. Have you looked into > writing to the DB with a larger batch size. I believe mysql-beam-connector > also supports this. > > > > On Wed, May 18, 2022 at 2:13 AM Mark Striebeck <mark.strieb...@gmail.com> > wrote: > >> Hi, >> >> We have a datapipeline that produces ~400M datapoints each day. If we run >> it without storing, it finishes in a little over an hour. If we run it and >> store the datapoints in a MySQL database it takes several hours. >> >> We are running on GCP dataflow, the MySQL instances are hosted GCP >> instances. We are using mysql-beam-connector >> <https://github.com/esakik/beam-mysql-connector>. >> >> The pipeline writes ~5000 datapoints per second. >> >> A couple of questions: >> >> - Does this throughput sound reasonable or could it be significantly >> improved by optimizing the database? >> - The pipeline runs several workers to write this out - and because >> it's a write operation they content for write access. Is it better to >> write >> out through just one worker and one connection? >> - Is it actually faster to write from the pipeline to pubsub or kafka >> or such and have a client on the other side which then writes in bulk? >> >> Thanks for any ideas or pointers (no, I'm by no means an >> experienced DBA!!!) >> >> Mark >> > -- Christian Battista he/him Senior Engineer II, Tech Lead *BenchSci* *www.benchsci.com <http://www.benchsci.com>* *E: *christ...@benchsci.com BenchSci is a CIX Top 10 Growth company, certified Great Place To Work®️, and now a Deloitte Technology Fast 50™ winner. Learn more <https://www.benchsci.com/news/benchsci-named-a-deloitte-technology-fast-50-company> .