Parquet schema per bucket in Streaming File Sink

Zack Loebel Mon, 29 Nov 2021 22:13:01 -0800

Hey all,

I have a job which writes data that is a similar shape to a location in s3.
Currently it writes a map of data with each row. I want to customize this
job to "explode" the map as column names and values, these are consistent
for a single bucket. Is there any way to do this? Provide a custom parquet
schema per bucket within a single dynamic sink?


I've started looking at the changes within the main codebase to make this
feasible. It seems straightforward to provide the bucketId to the
writerFactory, and the bucketId could be a type containing the relevant
schema information.
Although it appears that the BulkFormatBuilder has several spots where
BucketId appears to be required to be a String: specifically
the BucketAssigner and the CheckpointRollingPolicy both appear to be
required to have a bucketId of a String.

I'm curious if this is a change the community would be open to, and or if
there is another way to accomplish what I'm looking for that I've missed.

Thanks,
Zack

Parquet schema per bucket in Streaming File Sink

Reply via email to