We have the following use case: We are reading a stream of events which we
want to write to different parquet files based on data within the element
<IN>. The end goal is to register these parquet files in hive to query. I
was exploring the option of using StreamingFileSink for this use case but
found a new things which I could not customize.

It looks like StreamingFileSink takes a single schema / provides a
ParquetWriter for a specific schema. Since the elements needs to have
different Avro schema based on the data in the elements I could not use the
sink as-is (AvroParquetWriters, needs to specify the same Schema for the
parquetBuilder). So looking a bit deeper I found that there is a
WriterFactory here: https://tinyurl.com/y68drj35 . This can be extended to
create a BulkPartWriter based on BucketID. Something like this: BulkWriter.
Factory<IN, BucketID> writerFactory. In this way you can create a unique
ParquetWriter for each bucket.  Is there any other option to do this?

I considered another option of possibly using a common schema for all the
different schemas but I have not fully explored that option.

Thanks
Kailash

Reply via email to