Hello, I'm working on a project where I stream in data from Kafka, massage it a bit, and then wish to spit write it into HDFS using the RollingSink. This works just fine using the provided examples, but I would like the data to be stored in ORC on HDFS, rather than sequence files.
I am however unsure how to do this. I'm trying to create a new Writer class that can be set on the sink using setWriter, but the open() API for the writer is: open(org.apache.hadoop.fs.FSDataOutputStream outStream) This is problematic, because the ORC writer API has a constructor with signature: WriterImpl(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path path, OrcFile.WriterOptions opts) Since its not possible to extract the fs/path from a FSDataOutputStream, or alternatively inject the FSDataOutputStream into the ORC writer(it appears to require the path for some memory management), I looked at Parquet, which turns out to have the following constructor signature for its writer: ParquetFileWriter(Configuration configuration, MessageType schema, Path file) Again, the file path is required. At this point I'm a bit lost as how to proceed. Looking over the RollingSink code, it appears that the Writer interface could be changed to accept the filesystem and path, and then the current Writer functionality for managing the FSDataOutputStream could be moved to a class implementing the new Writer interface. This way the RollingSink functionality could be easily interface Parquet and ORC from the Hadoop ecosystem. It seems like this would not affect the failure semantics of the sink. I might of course be missing something obvious - if so, any hints would be greatly appreciated as this is my first venture into big-data, and especially Flink, which I'm enjoying very much! :) Best regards! Lasse