Hi everyone, I am currently working on implementing a custom I/O connector using Apache Beam, specifically aiming to run it on Google Cloud Dataflow. Here’s a quick overview of my workflow:
I receive an event via Pub/Sub, which includes a pointer to an Avro file stored in a GCS bucket. My goal is to wrap Dataflow’s functionality to make this a streaming process while also leveraging autoscaling capabilities on GCS metadata and not on PubSub metadata. My current approach is to create an *Unbounded Pub/Sub Source* that wraps a *Bounded GCS/Avro Source*. I intend to use checkpoints by leveraging metadata from both the Avro file and Pub/Sub. However, I'm encountering challenges with implementing the checkpointing mechanism, and I'm unsure about the optimal approach to get started with the project setup. Would anyone be able to provide guidance, suggestions, or examples? I’d really appreciate any help you can offer. Thank you very much! Best regards, Francesco