Apache Beam provides a built-in mechanism specifically for managing
per-key-and-window state that persists across workers and pipeline
restarts. Is there anything you can not use
https://beam.apache.org/documentation/programming-guide/#state-and-timers?

On Fri, Apr 25, 2025 at 8:45 AM Shaochen Bai <shaoc...@kisi.io> wrote:

> Hi all,
>
> I’m working on an online Apache Beam streaming pipeline where I need to
> store, read, and modify values across different windowed data — including
> across pipeline restarts.
>
> To handle this, I’m currently using *Google Cloud Bigtable* as my
> persistent storage backend. In my implementation:
>
>    -
>
>    I initialize a BigtableDataClient in the @Setup method of a DoFn
>    -
>
>    I use this client within processElement to read and write to Bigtable
>
> However, I’ve noticed that this setup may lead to increased thread and
> memory usage, especially when many DoFn instances are created in parallel.
>
> I’d really appreciate your input on a few questions:
>
>    1.
>
>    *Is using an external store like Bigtable the recommended approach to
>    persist state across windows (and restarts)?*
>    2.
>
>    *Are there optimizations or best practices for managing Bigtable
>    connections efficiently in this context?*
>    -
>
>       e.g., connection pooling, limiting client creation, or Beam-native
>       alternatives for external state?
>
> Any advice would be greatly appreciated
>
> Thanks in advance!
>
> ---
> This email is confidential/privileged. If you're not the intended
> recipient, please delete it and notify us immediately; please do not
> copy/use/disclose it for any purpose, to anyone. Thank you!
>

Reply via email to