Re: how to write kafka connect hdfs parquet sink.

Kidong Lee Mon, 25 Jul 2016 19:56:47 -0700

Hi Ewen,

do you mean, I should implement avro converter like AvroConverter
<https://github.com/confluentinc/schema-registry/blob/master/avro-converter/src/main/java/io/confluent/connect/avro/AvroConverter.java>
of confluent?
I think, I should also understand connect internal data structure which is
a bit complicated.


- Kidong.



2016-07-26 2:54 GMT+09:00 Ewen Cheslack-Postava <e...@confluent.io>:

> If I'm understanding your setup properly, you need a way to convert your
> data from your own Avro format to Connect format. From there, the existing
> Parquet support in the HDFS connector should work for you. So what you need
> is your own implementation of an AvroConverter, which is what loads the
> data from Kafka and turns it from byte[] to Connect's data API. Then you'd
> configure your HDFS connector with
> format.class=io.confluent.connect.hdfs.parquet.ParquetFormat.
>
> -Ewen
>
> On Mon, Jul 25, 2016 at 7:32 AM, Clifford Resnick <cresn...@mediamath.com>
> wrote:
>
> > You would probably use the Hadoop parquet-mr WriteSupport, which has less
> > to do with mapreduce, more to do with all the encodings that go into
> > writing a Parquet file. Avro as an intermediate serialization works
> great,
> > but I think most of your work would be in managing rolling from one file
> to
> > the next. There is a post process for every parquet file write where
> > metadata is extracted. Also, all Row Groups are kept in memory during
> write
> > so their sizing should be sane. Overall I think you should be able to do
> > it. I’ve done similar in the past.
> >
> > On 7/25/16, 10:20 AM, "Dustin Cote" <dus...@confluent.io> wrote:
> >
> >     I believe what you are looking for is a ParquetSerializer which I'm
> not
> >     aware of any existing ones.  In that case, you'd have to write your
> > own,
> >     and your AvroSerializer is probably a good thing to template from.
> > Then
> >     you would just use the HDFSSink Connector again and change the
> >     serialization format to use your newly written Parquet Serializer.
> >
> >     On Mon, Jul 25, 2016 at 12:35 AM, Kidong Lee <mykid...@gmail.com>
> > wrote:
> >
> >     > Hi,
> >     >
> >     > I have read confluent kafka connect hdfs
> >     > <
> http://docs.confluent.io/2.0.0/connect/connect-hdfs/docs/index.html>
> > but
> >     > I
> >     > don't want to use schema registry from confluent.
> >     >
> >     > I have produced avro encoded bytes to kafka, at that time, I have
> > written
> >     > my own avro serializer, not used KafkaAvroSerializer
> >     > <
> >     >
> >
> https://github.com/confluentinc/schema-registry/blob/master/avro-serializer/src/main/java/io/confluent/kafka/serializers/KafkaAvroSerializer.java
> >     > >
> >     > which seems to be related closely to Schema registry concept from
> >     > confluent.
> >     >
> >     > Now, I want to save my avro encoded from kafka to parquet on hdfs
> > using
> >     > Avro schema which is located in the classpath, for instance,
> >     > /META-INF/avro/xxx.avsc.
> >     >
> >     > Any idea to write parquet sink?
> >     >
> >     >
> >     > - Kidong Lee.
> >     >
> >
> >
> >
> >     --
> >     Dustin Cote
> >     confluent.io
> >
> >
> >
>
>
> --
> Thanks,
> Ewen
>

Re: how to write kafka connect hdfs parquet sink.

Reply via email to