We have the following use case: We are reading a stream of events which we want to write to different parquet files based on data within the element <IN>. The end goal is to register these parquet files in hive to query. I was exploring the option of using StreamingFileSink for this use case but found a new things which I could not customize.
It looks like StreamingFileSink takes a single schema / provides a ParquetWriter for a specific schema. Since the elements needs to have different Avro schema based on the data in the elements I could not use the sink as-is (AvroParquetWriters, needs to specify the same Schema for the parquetBuilder). So looking a bit deeper I found that there is a WriterFactory here: https://tinyurl.com/y68drj35 . This can be extended to create a BulkPartWriter based on BucketID. Something like this: BulkWriter. Factory<IN, BucketID> writerFactory. In this way you can create a unique ParquetWriter for each bucket. Is there any other option to do this? I considered another option of possibly using a common schema for all the different schemas but I have not fully explored that option. Thanks Kailash