Re: how to write kafka connect hdfs parquet sink.

Ewen Cheslack-Postava Mon, 25 Jul 2016 20:24:07 -0700

Kidong,

Yes, if you are using a different format for serializing data in Kafka, the
Converter interface is what you'd need to implement. We isolated
serialization + conversion from connectors precisely so connectors don't
need to worry about the exact format of data in Kafka, instead only having
to work with a generic, runtime data API. If you write that converter (or
at least the half for converting from byte[] -> Connect data API), the
existing functionality in the HDFS connector should work for you. You don't
even necessarily need a complete implementation supporting all the data
types in Connect if you only use a subset of them in practice.


-Ewen

On Mon, Jul 25, 2016 at 7:55 PM, Kidong Lee <mykid...@gmail.com> wrote:

> Hi Ewen,
>
> do you mean, I should implement avro converter like AvroConverter
> <
> https://github.com/confluentinc/schema-registry/blob/master/avro-converter/src/main/java/io/confluent/connect/avro/AvroConverter.java
> >
> of confluent?
> I think, I should also understand connect internal data structure which is
> a bit complicated.
>
> - Kidong.
>
>
>
> 2016-07-26 2:54 GMT+09:00 Ewen Cheslack-Postava <e...@confluent.io>:
>
> > If I'm understanding your setup properly, you need a way to convert your
> > data from your own Avro format to Connect format. From there, the
> existing
> > Parquet support in the HDFS connector should work for you. So what you
> need
> > is your own implementation of an AvroConverter, which is what loads the
> > data from Kafka and turns it from byte[] to Connect's data API. Then
> you'd
> > configure your HDFS connector with
> > format.class=io.confluent.connect.hdfs.parquet.ParquetFormat.
> >
> > -Ewen
> >
> > On Mon, Jul 25, 2016 at 7:32 AM, Clifford Resnick <
> cresn...@mediamath.com>
> > wrote:
> >
> > > You would probably use the Hadoop parquet-mr WriteSupport, which has
> less
> > > to do with mapreduce, more to do with all the encodings that go into
> > > writing a Parquet file. Avro as an intermediate serialization works
> > great,
> > > but I think most of your work would be in managing rolling from one
> file
> > to
> > > the next. There is a post process for every parquet file write where
> > > metadata is extracted. Also, all Row Groups are kept in memory during
> > write
> > > so their sizing should be sane. Overall I think you should be able to
> do
> > > it. I’ve done similar in the past.
> > >
> > > On 7/25/16, 10:20 AM, "Dustin Cote" <dus...@confluent.io> wrote:
> > >
> > >     I believe what you are looking for is a ParquetSerializer which I'm
> > not
> > >     aware of any existing ones.  In that case, you'd have to write your
> > > own,
> > >     and your AvroSerializer is probably a good thing to template from.
> > > Then
> > >     you would just use the HDFSSink Connector again and change the
> > >     serialization format to use your newly written Parquet Serializer.
> > >
> > >     On Mon, Jul 25, 2016 at 12:35 AM, Kidong Lee <mykid...@gmail.com>
> > > wrote:
> > >
> > >     > Hi,
> > >     >
> > >     > I have read confluent kafka connect hdfs
> > >     > <
> > http://docs.confluent.io/2.0.0/connect/connect-hdfs/docs/index.html>
> > > but
> > >     > I
> > >     > don't want to use schema registry from confluent.
> > >     >
> > >     > I have produced avro encoded bytes to kafka, at that time, I have
> > > written
> > >     > my own avro serializer, not used KafkaAvroSerializer
> > >     > <
> > >     >
> > >
> >
> https://github.com/confluentinc/schema-registry/blob/master/avro-serializer/src/main/java/io/confluent/kafka/serializers/KafkaAvroSerializer.java
> > >     > >
> > >     > which seems to be related closely to Schema registry concept from
> > >     > confluent.
> > >     >
> > >     > Now, I want to save my avro encoded from kafka to parquet on hdfs
> > > using
> > >     > Avro schema which is located in the classpath, for instance,
> > >     > /META-INF/avro/xxx.avsc.
> > >     >
> > >     > Any idea to write parquet sink?
> > >     >
> > >     >
> > >     > - Kidong Lee.
> > >     >
> > >
> > >
> > >
> > >     --
> > >     Dustin Cote
> > >     confluent.io
> > >
> > >
> > >
> >
> >
> > --
> > Thanks,
> > Ewen
> >
>



-- 
Thanks,
Ewen

Re: how to write kafka connect hdfs parquet sink.

Reply via email to