Hello Beam team!

We’re currently onboarding customer’s infrastructure to the Google Cloud 
Platform. The decision was made that one of the technologies they will use is 
Dataflow. Let me briefly the usecase specification:
They have kafka cluster where data from CDC data source is stored. The data in 
the topics is stored as Avro format. Their other requirement is they want to 
have a streaming solution reading from these Kafka topics, and writing to the 
Google Cloud Storage again in Avro. What’s more, the component should be 
written in Python, since their Data Engineers heavily prefer Python instead of 
Java.

We’ve been struggling with the design of the solution for couple of weeks now, 
and we’re facing quite unfortunate situation now, not really finding any 
solution that would fit these requirements.

So the question is: Is there any existing Dataflow template/solution with the 
following specifications:

  *   Streaming connector
  *   Written in Python
  *   Consumes from Kafka topics
  *   Reads Avro with Schema Registry
  *   Writes Avro to GCS

We found out, that current sinks for Avro and Parquet with the destination of 
GCS are not supported for Python at the moment, which is basically the main 
blocker now.

Any recommendations/suggestions would be really highly appreciated!

Maybe the solution really does not exist and we need to create our own custom 
connector for it. The question in this case would be if that’s even possible 
theoretically, since we would really need to avoid another dead end.

Thanks a lot for any help!

Kind regards,
Ondrej

Reply via email to