You would probably use the Hadoop parquet-mr WriteSupport, which has less to do 
with mapreduce, more to do with all the encodings that go into writing a 
Parquet file. Avro as an intermediate serialization works great, but I think 
most of your work would be in managing rolling from one file to the next. There 
is a post process for every parquet file write where metadata is extracted. 
Also, all Row Groups are kept in memory during write so their sizing should be 
sane. Overall I think you should be able to do it. I’ve done similar in the 
past.

On 7/25/16, 10:20 AM, "Dustin Cote" <dus...@confluent.io> wrote:

    I believe what you are looking for is a ParquetSerializer which I'm not
    aware of any existing ones.  In that case, you'd have to write your own,
    and your AvroSerializer is probably a good thing to template from.  Then
    you would just use the HDFSSink Connector again and change the
    serialization format to use your newly written Parquet Serializer.
    
    On Mon, Jul 25, 2016 at 12:35 AM, Kidong Lee <mykid...@gmail.com> wrote:
    
    > Hi,
    >
    > I have read confluent kafka connect hdfs
    > <http://docs.confluent.io/2.0.0/connect/connect-hdfs/docs/index.html> but
    > I
    > don't want to use schema registry from confluent.
    >
    > I have produced avro encoded bytes to kafka, at that time, I have written
    > my own avro serializer, not used KafkaAvroSerializer
    > <
    > 
https://github.com/confluentinc/schema-registry/blob/master/avro-serializer/src/main/java/io/confluent/kafka/serializers/KafkaAvroSerializer.java
    > >
    > which seems to be related closely to Schema registry concept from
    > confluent.
    >
    > Now, I want to save my avro encoded from kafka to parquet on hdfs using
    > Avro schema which is located in the classpath, for instance,
    > /META-INF/avro/xxx.avsc.
    >
    > Any idea to write parquet sink?
    >
    >
    > - Kidong Lee.
    >
    
    
    
    -- 
    Dustin Cote
    confluent.io
    

Reply via email to