Kidong, Yes, if you are using a different format for serializing data in Kafka, the Converter interface is what you'd need to implement. We isolated serialization + conversion from connectors precisely so connectors don't need to worry about the exact format of data in Kafka, instead only having to work with a generic, runtime data API. If you write that converter (or at least the half for converting from byte[] -> Connect data API), the existing functionality in the HDFS connector should work for you. You don't even necessarily need a complete implementation supporting all the data types in Connect if you only use a subset of them in practice.
-Ewen On Mon, Jul 25, 2016 at 7:55 PM, Kidong Lee <mykid...@gmail.com> wrote: > Hi Ewen, > > do you mean, I should implement avro converter like AvroConverter > < > https://github.com/confluentinc/schema-registry/blob/master/avro-converter/src/main/java/io/confluent/connect/avro/AvroConverter.java > > > of confluent? > I think, I should also understand connect internal data structure which is > a bit complicated. > > - Kidong. > > > > 2016-07-26 2:54 GMT+09:00 Ewen Cheslack-Postava <e...@confluent.io>: > > > If I'm understanding your setup properly, you need a way to convert your > > data from your own Avro format to Connect format. From there, the > existing > > Parquet support in the HDFS connector should work for you. So what you > need > > is your own implementation of an AvroConverter, which is what loads the > > data from Kafka and turns it from byte[] to Connect's data API. Then > you'd > > configure your HDFS connector with > > format.class=io.confluent.connect.hdfs.parquet.ParquetFormat. > > > > -Ewen > > > > On Mon, Jul 25, 2016 at 7:32 AM, Clifford Resnick < > cresn...@mediamath.com> > > wrote: > > > > > You would probably use the Hadoop parquet-mr WriteSupport, which has > less > > > to do with mapreduce, more to do with all the encodings that go into > > > writing a Parquet file. Avro as an intermediate serialization works > > great, > > > but I think most of your work would be in managing rolling from one > file > > to > > > the next. There is a post process for every parquet file write where > > > metadata is extracted. Also, all Row Groups are kept in memory during > > write > > > so their sizing should be sane. Overall I think you should be able to > do > > > it. I’ve done similar in the past. > > > > > > On 7/25/16, 10:20 AM, "Dustin Cote" <dus...@confluent.io> wrote: > > > > > > I believe what you are looking for is a ParquetSerializer which I'm > > not > > > aware of any existing ones. In that case, you'd have to write your > > > own, > > > and your AvroSerializer is probably a good thing to template from. > > > Then > > > you would just use the HDFSSink Connector again and change the > > > serialization format to use your newly written Parquet Serializer. > > > > > > On Mon, Jul 25, 2016 at 12:35 AM, Kidong Lee <mykid...@gmail.com> > > > wrote: > > > > > > > Hi, > > > > > > > > I have read confluent kafka connect hdfs > > > > < > > http://docs.confluent.io/2.0.0/connect/connect-hdfs/docs/index.html> > > > but > > > > I > > > > don't want to use schema registry from confluent. > > > > > > > > I have produced avro encoded bytes to kafka, at that time, I have > > > written > > > > my own avro serializer, not used KafkaAvroSerializer > > > > < > > > > > > > > > > https://github.com/confluentinc/schema-registry/blob/master/avro-serializer/src/main/java/io/confluent/kafka/serializers/KafkaAvroSerializer.java > > > > > > > > > which seems to be related closely to Schema registry concept from > > > > confluent. > > > > > > > > Now, I want to save my avro encoded from kafka to parquet on hdfs > > > using > > > > Avro schema which is located in the classpath, for instance, > > > > /META-INF/avro/xxx.avsc. > > > > > > > > Any idea to write parquet sink? > > > > > > > > > > > > - Kidong Lee. > > > > > > > > > > > > > > > > -- > > > Dustin Cote > > > confluent.io > > > > > > > > > > > > > > > -- > > Thanks, > > Ewen > > > -- Thanks, Ewen