Hi Ewen, do you mean, I should implement avro converter like AvroConverter <https://github.com/confluentinc/schema-registry/blob/master/avro-converter/src/main/java/io/confluent/connect/avro/AvroConverter.java> of confluent? I think, I should also understand connect internal data structure which is a bit complicated.
- Kidong. 2016-07-26 2:54 GMT+09:00 Ewen Cheslack-Postava <e...@confluent.io>: > If I'm understanding your setup properly, you need a way to convert your > data from your own Avro format to Connect format. From there, the existing > Parquet support in the HDFS connector should work for you. So what you need > is your own implementation of an AvroConverter, which is what loads the > data from Kafka and turns it from byte[] to Connect's data API. Then you'd > configure your HDFS connector with > format.class=io.confluent.connect.hdfs.parquet.ParquetFormat. > > -Ewen > > On Mon, Jul 25, 2016 at 7:32 AM, Clifford Resnick <cresn...@mediamath.com> > wrote: > > > You would probably use the Hadoop parquet-mr WriteSupport, which has less > > to do with mapreduce, more to do with all the encodings that go into > > writing a Parquet file. Avro as an intermediate serialization works > great, > > but I think most of your work would be in managing rolling from one file > to > > the next. There is a post process for every parquet file write where > > metadata is extracted. Also, all Row Groups are kept in memory during > write > > so their sizing should be sane. Overall I think you should be able to do > > it. I’ve done similar in the past. > > > > On 7/25/16, 10:20 AM, "Dustin Cote" <dus...@confluent.io> wrote: > > > > I believe what you are looking for is a ParquetSerializer which I'm > not > > aware of any existing ones. In that case, you'd have to write your > > own, > > and your AvroSerializer is probably a good thing to template from. > > Then > > you would just use the HDFSSink Connector again and change the > > serialization format to use your newly written Parquet Serializer. > > > > On Mon, Jul 25, 2016 at 12:35 AM, Kidong Lee <mykid...@gmail.com> > > wrote: > > > > > Hi, > > > > > > I have read confluent kafka connect hdfs > > > < > http://docs.confluent.io/2.0.0/connect/connect-hdfs/docs/index.html> > > but > > > I > > > don't want to use schema registry from confluent. > > > > > > I have produced avro encoded bytes to kafka, at that time, I have > > written > > > my own avro serializer, not used KafkaAvroSerializer > > > < > > > > > > https://github.com/confluentinc/schema-registry/blob/master/avro-serializer/src/main/java/io/confluent/kafka/serializers/KafkaAvroSerializer.java > > > > > > > which seems to be related closely to Schema registry concept from > > > confluent. > > > > > > Now, I want to save my avro encoded from kafka to parquet on hdfs > > using > > > Avro schema which is located in the classpath, for instance, > > > /META-INF/avro/xxx.avsc. > > > > > > Any idea to write parquet sink? > > > > > > > > > - Kidong Lee. > > > > > > > > > > > -- > > Dustin Cote > > confluent.io > > > > > > > > > -- > Thanks, > Ewen >