Re: Problems with Kafka Connect HDFS Sink

Abdoulaye Diallo Tue, 18 Jul 2017 14:48:30 -0700

Look into these 2 props:
rotate.schedule.interval.ms
flush.size

On Tue, Jul 18, 2017 at 2:46 PM, Abdoulaye Diallo <abdoulaye...@gmail.com>
wrote:


> Hi Debasish,
>
>
> > flush.size=3
> this means every 3 messages in that topic will end up in its own HDFS
> file, which is probably why you end up with so many files that ls hurts.
> You should flush a bigger batch or after a high enough interval.
>
>
> > tasks.max=1
> Unless you have a single partition topic, you need to up this number for
> better parallelism.
>
> HTH,
> Abdoulaye
>
>
> On Tue, Jul 18, 2017 at 11:12 AM, Debasish Ghosh <ghosh.debas...@gmail.com
> > wrote:
>
>> Hi -
>>
>> I have a Kafka Streams application that generates Avro records in a topic,
>> which is being read by a Kafka Connect process that uses HDFS Sink
>> connector. The topic has around 1.6 million messages. And the Kafka
>> Connect
>> script is as follows ..
>>
>> bin/connect-standalone
>> > etc/schema-registry/connect-avro-standalone.properties
>> > etc/kafka-connect-hdfs/quickstart-hdfs.properties
>>
>>
>> where quickstart-hdfs.properties contains the following ..
>>
>> name=hdfs-sink
>> > connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
>> > tasks.max=1
>> > topics=avro-topic
>> > hdfs.url=hdfs://0.0.0.0:9000
>> > flush.size=3
>>
>>
>> The problem is that the Kafka Connect process looks to be running in an
>> infinite loop with messages like the following ..
>>
>> [2017-07-18 20:02:04,487] INFO Starting commit and rotation for topic
>> > partition avro-topic-0 with start offsets {partition=0=1143033} and end
>> > offsets {partition=0=1143035}
>> > (io.confluent.connect.hdfs.TopicPartitionWriter:297)
>> > [2017-07-18 20:02:04,491] INFO Committed hdfs://
>> > 0.0.0.0:9000/topics/avro-topic/partition=0/avro-topic+0+
>> 0001143033+0001143035.avro
>> > for avro-topic-0 (io.confluent.connect.hdfs.TopicPartitionWriter:625)
>>
>>
>> The result is that the avro files created are so many in numbers that I
>> cannot do an ls on the folder.
>>
>> $ hdfs dfs -ls /topics/avro-topic
>> > Found 1 items
>> > drwxr-xr-x   - debasishghosh supergroup          0 2017-07-18 20:02
>> > /topics/avro-topic/partition=0
>>
>>
>> Trying to list to more depth in the HDFS folder results in an
>> OutOfMemoryError ..
>>
>> $ hdfs dfs -ls /topics/avro-topic/partition=0
>> > 17/07/18 20:02:19 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop
>> > library for your platform... using builtin-java classes where applicable
>> > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
>> > exceeded
>> > at java.util.Arrays.copyOfRange(Arrays.java:3664)
>> > at java.lang.String.<init>(String.java:207)
>> > at java.lang.String.substring(String.java:1969)
>> > at java.net.URI$Parser.substring(URI.java:2869)
>> > at java.net.URI$Parser.parseHierarchical(URI.java:3106)
>> > ...
>>
>>
>> Why is the Kafka Connect program going in an infinite loop ? How can I
>> prevent it ?
>>
>> I am using Confluent 3.2.2 for the schema registry, Avro serialization
>> part
>> and Apache Kafka 0.10.2.1 for Kafka Streams client and the broker part.
>>
>> Help ?
>>
>> regards.
>>
>> --
>> Debasish Ghosh
>> http://manning.com/ghosh2
>> http://manning.com/ghosh
>>
>> Twttr: @debasishg
>> Blog: http://debasishg.blogspot.com
>> Code: http://github.com/debasishg
>>
>
>
>
> --
> Abdoulaye Diallo
>



-- 
Abdoulaye Diallo

Re: Problems with Kafka Connect HDFS Sink

Reply via email to