Re: how to write kafka connect hdfs parquet sink.

Ewen Cheslack-Postava Mon, 25 Jul 2016 10:56:00 -0700

If I'm understanding your setup properly, you need a way to convert your
data from your own Avro format to Connect format. From there, the existing
Parquet support in the HDFS connector should work for you. So what you need
is your own implementation of an AvroConverter, which is what loads the
data from Kafka and turns it from byte[] to Connect's data API. Then you'd
configure your HDFS connector with
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat.


-Ewen

On Mon, Jul 25, 2016 at 7:32 AM, Clifford Resnick <cresn...@mediamath.com>
wrote:

> You would probably use the Hadoop parquet-mr WriteSupport, which has less
> to do with mapreduce, more to do with all the encodings that go into
> writing a Parquet file. Avro as an intermediate serialization works great,
> but I think most of your work would be in managing rolling from one file to
> the next. There is a post process for every parquet file write where
> metadata is extracted. Also, all Row Groups are kept in memory during write
> so their sizing should be sane. Overall I think you should be able to do
> it. I’ve done similar in the past.
>
> On 7/25/16, 10:20 AM, "Dustin Cote" <dus...@confluent.io> wrote:
>
>     I believe what you are looking for is a ParquetSerializer which I'm not
>     aware of any existing ones.  In that case, you'd have to write your
> own,
>     and your AvroSerializer is probably a good thing to template from.
> Then
>     you would just use the HDFSSink Connector again and change the
>     serialization format to use your newly written Parquet Serializer.
>
>     On Mon, Jul 25, 2016 at 12:35 AM, Kidong Lee <mykid...@gmail.com>
> wrote:
>
>     > Hi,
>     >
>     > I have read confluent kafka connect hdfs
>     > <http://docs.confluent.io/2.0.0/connect/connect-hdfs/docs/index.html>
> but
>     > I
>     > don't want to use schema registry from confluent.
>     >
>     > I have produced avro encoded bytes to kafka, at that time, I have
> written
>     > my own avro serializer, not used KafkaAvroSerializer
>     > <
>     >
> https://github.com/confluentinc/schema-registry/blob/master/avro-serializer/src/main/java/io/confluent/kafka/serializers/KafkaAvroSerializer.java
>     > >
>     > which seems to be related closely to Schema registry concept from
>     > confluent.
>     >
>     > Now, I want to save my avro encoded from kafka to parquet on hdfs
> using
>     > Avro schema which is located in the classpath, for instance,
>     > /META-INF/avro/xxx.avsc.
>     >
>     > Any idea to write parquet sink?
>     >
>     >
>     > - Kidong Lee.
>     >
>
>
>
>     --
>     Dustin Cote
>     confluent.io
>
>
>


-- 
Thanks,
Ewen

Re: how to write kafka connect hdfs parquet sink.

Reply via email to