Look into these 2 props: rotate.schedule.interval.ms flush.size On Tue, Jul 18, 2017 at 2:46 PM, Abdoulaye Diallo <abdoulaye...@gmail.com> wrote:
> Hi Debasish, > > > > flush.size=3 > this means every 3 messages in that topic will end up in its own HDFS > file, which is probably why you end up with so many files that ls hurts. > You should flush a bigger batch or after a high enough interval. > > > > tasks.max=1 > Unless you have a single partition topic, you need to up this number for > better parallelism. > > HTH, > Abdoulaye > > > On Tue, Jul 18, 2017 at 11:12 AM, Debasish Ghosh <ghosh.debas...@gmail.com > > wrote: > >> Hi - >> >> I have a Kafka Streams application that generates Avro records in a topic, >> which is being read by a Kafka Connect process that uses HDFS Sink >> connector. The topic has around 1.6 million messages. And the Kafka >> Connect >> script is as follows .. >> >> bin/connect-standalone >> > etc/schema-registry/connect-avro-standalone.properties >> > etc/kafka-connect-hdfs/quickstart-hdfs.properties >> >> >> where quickstart-hdfs.properties contains the following .. >> >> name=hdfs-sink >> > connector.class=io.confluent.connect.hdfs.HdfsSinkConnector >> > tasks.max=1 >> > topics=avro-topic >> > hdfs.url=hdfs://0.0.0.0:9000 >> > flush.size=3 >> >> >> The problem is that the Kafka Connect process looks to be running in an >> infinite loop with messages like the following .. >> >> [2017-07-18 20:02:04,487] INFO Starting commit and rotation for topic >> > partition avro-topic-0 with start offsets {partition=0=1143033} and end >> > offsets {partition=0=1143035} >> > (io.confluent.connect.hdfs.TopicPartitionWriter:297) >> > [2017-07-18 20:02:04,491] INFO Committed hdfs:// >> > 0.0.0.0:9000/topics/avro-topic/partition=0/avro-topic+0+ >> 0001143033+0001143035.avro >> > for avro-topic-0 (io.confluent.connect.hdfs.TopicPartitionWriter:625) >> >> >> The result is that the avro files created are so many in numbers that I >> cannot do an ls on the folder. >> >> $ hdfs dfs -ls /topics/avro-topic >> > Found 1 items >> > drwxr-xr-x - debasishghosh supergroup 0 2017-07-18 20:02 >> > /topics/avro-topic/partition=0 >> >> >> Trying to list to more depth in the HDFS folder results in an >> OutOfMemoryError .. >> >> $ hdfs dfs -ls /topics/avro-topic/partition=0 >> > 17/07/18 20:02:19 WARN util.NativeCodeLoader: Unable to load >> native-hadoop >> > library for your platform... using builtin-java classes where applicable >> > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit >> > exceeded >> > at java.util.Arrays.copyOfRange(Arrays.java:3664) >> > at java.lang.String.<init>(String.java:207) >> > at java.lang.String.substring(String.java:1969) >> > at java.net.URI$Parser.substring(URI.java:2869) >> > at java.net.URI$Parser.parseHierarchical(URI.java:3106) >> > ... >> >> >> Why is the Kafka Connect program going in an infinite loop ? How can I >> prevent it ? >> >> I am using Confluent 3.2.2 for the schema registry, Avro serialization >> part >> and Apache Kafka 0.10.2.1 for Kafka Streams client and the broker part. >> >> Help ? >> >> regards. >> >> -- >> Debasish Ghosh >> http://manning.com/ghosh2 >> http://manning.com/ghosh >> >> Twttr: @debasishg >> Blog: http://debasishg.blogspot.com >> Code: http://github.com/debasishg >> > > > > -- > Abdoulaye Diallo > -- Abdoulaye Diallo