You would probably use the Hadoop parquet-mr WriteSupport, which has less to do with mapreduce, more to do with all the encodings that go into writing a Parquet file. Avro as an intermediate serialization works great, but I think most of your work would be in managing rolling from one file to the next. There is a post process for every parquet file write where metadata is extracted. Also, all Row Groups are kept in memory during write so their sizing should be sane. Overall I think you should be able to do it. I’ve done similar in the past.
On 7/25/16, 10:20 AM, "Dustin Cote" <dus...@confluent.io> wrote: I believe what you are looking for is a ParquetSerializer which I'm not aware of any existing ones. In that case, you'd have to write your own, and your AvroSerializer is probably a good thing to template from. Then you would just use the HDFSSink Connector again and change the serialization format to use your newly written Parquet Serializer. On Mon, Jul 25, 2016 at 12:35 AM, Kidong Lee <mykid...@gmail.com> wrote: > Hi, > > I have read confluent kafka connect hdfs > <http://docs.confluent.io/2.0.0/connect/connect-hdfs/docs/index.html> but > I > don't want to use schema registry from confluent. > > I have produced avro encoded bytes to kafka, at that time, I have written > my own avro serializer, not used KafkaAvroSerializer > < > https://github.com/confluentinc/schema-registry/blob/master/avro-serializer/src/main/java/io/confluent/kafka/serializers/KafkaAvroSerializer.java > > > which seems to be related closely to Schema registry concept from > confluent. > > Now, I want to save my avro encoded from kafka to parquet on hdfs using > Avro schema which is located in the classpath, for instance, > /META-INF/avro/xxx.avsc. > > Any idea to write parquet sink? > > > - Kidong Lee. > -- Dustin Cote confluent.io