Apache Beam provides a built-in mechanism specifically for managing per-key-and-window state that persists across workers and pipeline restarts. Is there anything you can not use https://beam.apache.org/documentation/programming-guide/#state-and-timers?
On Fri, Apr 25, 2025 at 8:45 AM Shaochen Bai <shaoc...@kisi.io> wrote: > Hi all, > > I’m working on an online Apache Beam streaming pipeline where I need to > store, read, and modify values across different windowed data — including > across pipeline restarts. > > To handle this, I’m currently using *Google Cloud Bigtable* as my > persistent storage backend. In my implementation: > > - > > I initialize a BigtableDataClient in the @Setup method of a DoFn > - > > I use this client within processElement to read and write to Bigtable > > However, I’ve noticed that this setup may lead to increased thread and > memory usage, especially when many DoFn instances are created in parallel. > > I’d really appreciate your input on a few questions: > > 1. > > *Is using an external store like Bigtable the recommended approach to > persist state across windows (and restarts)?* > 2. > > *Are there optimizations or best practices for managing Bigtable > connections efficiently in this context?* > - > > e.g., connection pooling, limiting client creation, or Beam-native > alternatives for external state? > > Any advice would be greatly appreciated > > Thanks in advance! > > --- > This email is confidential/privileged. If you're not the intended > recipient, please delete it and notify us immediately; please do not > copy/use/disclose it for any purpose, to anyone. Thank you! >