Hi all, I’m working on an online Apache Beam streaming pipeline where I need to store, read, and modify values across different windowed data — including across pipeline restarts.
To handle this, I’m currently using Google Cloud Bigtable as my persistent storage backend. In my implementation: I initialize a BigtableDataClient in the @Setup method of a DoFn I use this client within processElement to read and write to Bigtable However, I’ve noticed that this setup may lead to increased thread and memory usage, especially when many DoFn instances are created in parallel. I’d really appreciate your input on a few questions: Is using an external store like Bigtable the recommended approach to persist state across windows (and restarts)? Are there optimizations or best practices for managing Bigtable connections efficiently in this context? e.g., connection pooling, limiting client creation, or Beam-native alternatives for external state? Any advice would be greatly appreciated Thanks in advance! -- --- This email is confidential/privileged. If you're not the intended recipient, please delete it and notify us immediately; please do not copy/use/disclose it for any purpose, to anyone. Thank you!