Hi everyone,

I am currently working on implementing a custom I/O connector using Apache
Beam, specifically aiming to run it on Google Cloud Dataflow. Here’s a
quick overview of my workflow:

I receive an event via Pub/Sub, which includes a pointer to an Avro file
stored in a GCS bucket. My goal is to wrap Dataflow’s functionality to make
this a streaming process while also leveraging autoscaling capabilities on
GCS metadata and not on PubSub metadata.

My current approach is to create an *Unbounded Pub/Sub Source* that wraps a
*Bounded GCS/Avro Source*. I intend to use checkpoints by leveraging
metadata from both the Avro file and Pub/Sub.

However, I'm encountering challenges with implementing the checkpointing
mechanism, and I'm unsure about the optimal approach to get started with
the project setup.

Would anyone be able to provide guidance, suggestions, or examples? I’d
really appreciate any help you can offer.

Thank you very much!

Best regards,
Francesco

Reply via email to