Feature requests for Flink protobuf deserialization

Adam Richardson Tue, 27 Jun 2023 19:16:10 -0700

Hi there,

My company is in the process of rebuilding some of our batch Spark-based
ETL pipelines in Flink. We use protobuf to define our schemas. One major
challenge is that Flink protobuf deserialization has some semantic
differences with the ScalaPB encoders we use in our Spark systems. This
poses a serious barrier for adoption as moving any given dataset from Spark
to Flink will potentially break all downstream consumers. I have a long
list of feature requests in this area:


   1. Support for mapping protobuf optional wrapper types (StringValue,
   IntValue, and friends) to nullable primitive types rather than RowTypes
   2. Support for mapping the protobuf Timestamp type to a real timestamp
   rather than RowType
   3. A way of defining custom mappings from specific proto types to custom
   Flink types (the previous two feature requests could be implemented on top
   of this lower-level feature)
   4. Support for nullability semantics for message types (in the status
   quo, an unset message is treated as equivalent to a message with default
   values for all fields, which is a confusing user experience)
   5. Support for nullability semantics for primitives types (in many of
   our services, the default value for a field of primitive type is treated as
   being equivalent to unset or null, so it would be good to offer this as a
   capability in the data warehouse)

Would Flink accept patches for any or all of these feature requests? We're
contemplating forking flink-protobuf internally, but it would be better if
we could just upstream the relevant changes. (To my mind, 1, 2, and 4 are
broadly applicable features that are definitely worthy of upstream support.
3 and 5 may be somewhat more specific to our use case.)

Thanks,
Adam Richardson

Feature requests for Flink protobuf deserialization

Reply via email to