[Question] Best Practices for Managing Persistent State with Bigtable in Streaming Beam Pipelines

Shaochen Bai Fri, 25 Apr 2025 05:47:26 -0700

Hi all,

I’m working on an online Apache Beam streaming pipeline where I need to store, 
read, and modify values across different windowed data — including across 
pipeline restarts.


To handle this, I’m currently using Google Cloud Bigtable as my persistent 
storage backend. In my implementation:

I initialize a BigtableDataClient in the @Setup method of a DoFn

I use this client within processElement to read and write to Bigtable

However, I’ve noticed that this setup may lead to increased thread and memory 
usage, especially when many DoFn instances are created in parallel.

I’d really appreciate your input on a few questions:

Is using an external store like Bigtable the recommended approach to persist 
state across windows (and restarts)?

Are there optimizations or best practices for managing Bigtable connections 
efficiently in this context?

e.g., connection pooling, limiting client creation, or Beam-native alternatives 
for external state?

Any advice would be greatly appreciated

Thanks in advance!
-- 
---
This email is confidential/privileged. If you're not the intended 
recipient, please delete it and notify us immediately; please do not 
copy/use/disclose it for any purpose, to anyone. Thank you!

[Question] Best Practices for Managing Persistent State with Bigtable in Streaming Beam Pipelines

Reply via email to