Re: [Question] Best Practices for Managing Persistent State with Bigtable in Streaming Beam Pipelines

XQ Hu via user Mon, 28 Apr 2025 07:55:00 -0700

Please check
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#state-data
and https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#CCheck


https://beam.apache.org/blog/stateful-processing/ has more details about
how to use stateful DoFn.

The state persists. And you can update your jobs as long as it passes
the compatibility check.

On Mon, Apr 28, 2025 at 5:12 AM Shaochen Bai <shaoc...@kisi.io> wrote:

> Hello,
>
> Thank you for your response. I was not aware that state in Apache Beam
> persists across different jobs — there seem to be very few open resources
> discussing this. Here is one of the few I found.
>
> I do have some concerns regarding state management:
>
> 1. Does the state persist if the pipeline gets stuck and we have to cancel
> or force-cancel the job?
>
> 2. Does the state persist if we modify the structure of the pipeline and
> use the state in a different DoFn?
>
> 3. It appears that we need to specify the persistent disk size when
> deploying the pipeline. Since we may need to scale the disk size as the
> state grows, will all existing state persist correctly after scaling?
>
> Since we do not have a clear understanding of the state persistence
> mechanism and its expected behavior, we are hesitant to adopt it fully. If
> you could point me to any public references or resources on this topic, I
> would greatly appreciate it.
>
> Thank you again for your help.
>
> Best regards,
>
>
> Reference:
> [image: apple-touch-i...@2.png]
>
> Dataflow - State persistence specs
> <https://stackoverflow.com/questions/69835743/dataflow-state-persistence-specs>
> stackoverflow.com
> <https://stackoverflow.com/questions/69835743/dataflow-state-persistence-specs>
>
> <https://stackoverflow.com/questions/69835743/dataflow-state-persistence-specs>
>
>
> On 25 Apr 2025, at 17:12, XQ Hu via user <user@beam.apache.org> wrote:
>
> Apache Beam provides a built-in mechanism specifically for managing
> per-key-and-window state that persists across workers and pipeline
> restarts. Is there anything you can not use
> https://beam.apache.org/documentation/programming-guide/#state-and-timers
> <https://www.google.com/url?q=https://beam.apache.org/documentation/programming-guide/%23state-and-timers&source=gmail-imap&ust=1746198771000000&usg=AOvVaw0zLj4Td0V5wpPSGYig8lTf>
> ?
>
> On Fri, Apr 25, 2025 at 8:45 AM Shaochen Bai <shaoc...@kisi.io> wrote:
>
>> Hi all,
>>
>> I’m working on an online Apache Beam streaming pipeline where I need to
>> store, read, and modify values across different windowed data — including
>> across pipeline restarts.
>>
>> To handle this, I’m currently using *Google Cloud Bigtable* as my
>> persistent storage backend. In my implementation:
>>
>>    -
>>
>>    I initialize a BigtableDataClient in the @Setup method of a DoFn
>>    -
>>
>>    I use this client within processElement to read and write to Bigtable
>>
>> However, I’ve noticed that this setup may lead to increased thread and
>> memory usage, especially when many DoFn instances are created in
>> parallel.
>>
>> I’d really appreciate your input on a few questions:
>>
>>    1.
>>
>>    *Is using an external store like Bigtable the recommended approach to
>>    persist state across windows (and restarts)?*
>>    2.
>>
>>    *Are there optimizations or best practices for managing Bigtable
>>    connections efficiently in this context?*
>>    -
>>
>>       e.g., connection pooling, limiting client creation, or Beam-native
>>       alternatives for external state?
>>
>> Any advice would be greatly appreciated
>>
>> Thanks in advance!
>>
>> ---
>> This email is confidential/privileged. If you're not the intended
>> recipient, please delete it and notify us immediately; please do not
>> copy/use/disclose it for any purpose, to anyone. Thank you!
>>
>
>
> ---
> This email is confidential/privileged. If you're not the intended
> recipient, please delete it and notify us immediately; please do not
> copy/use/disclose it for any purpose, to anyone. Thank you!
>

Re: [Question] Best Practices for Managing Persistent State with Bigtable in Streaming Beam Pipelines

Reply via email to