Re: Help is processing huge data through Kafka-storm cluster

Robert Rodgers Tue, 17 Jun 2014 10:56:24 -0700

we have been experimenting with Samza which is also worth a look.  It's 
basically a topic-to-topic node on Yarn.




On Jun 17, 2014, at 10:44 AM, hsy...@gmail.com wrote:

> Hi Shaikh,
> 
> I heard some throughput bottleneck of storm. It cannot really scale up with
> kafka.
> I recommend you to try DataTorrent platform(https://www.datatorrent.com/)
> 
> The platform itself is not open-source but it has a open-source library (
> https://github.com/DataTorrent/Malhar) which contains a kafka ingestion
> functions.
> The library is pretty cool, it can scale up dynamically with kafka
> partitions and is fully HA.
> 
> And in your case you might be able to use the platform for free.(It's free
> if your application doesn't require large amount of memory)
> 
> With datatorrent platform and the open-source library I can scale my
> application up to 300k/s (10 nodes, 3 replica, 1kb msg, 0.8.0 client).
> I heard the performance of kafka client has been improved for 0.8.1 release
> :)
> 
> Best,
> Siyuan
> 
> 
> On Sat, Jun 14, 2014 at 8:14 AM, Shaikh Ahmed <rnsr.sha...@gmail.com> wrote:
> 
>> Hi,
>> 
>> Daily we are downloaded 28 Million of messages and Monthly it goes up to
>> 800+ million.
>> 
>> We want to process this amount of data through our kafka and storm cluster
>> and would like to store in HBase cluster.
>> 
>> We are targeting to process one month of data in one day. Is it possible?
>> 
>> We have setup our cluster thinking that we can process million of messages
>> in one sec as mentioned on web. Unfortunately, we have ended-up with
>> processing only 1200-1700 message per second.  if we continue with this
>> speed than it will take min 10 days to process 30 days of data, which is
>> the relevant solution in our case.
>> 
>> I suspect that we have to change some configuration to achieve this goal.
>> Looking for help from experts to support me in achieving this task.
>> 
>> *Kafka Cluster:*
>> Kafka is running on two dedicated machines with 48 GB of RAM and 2TB of
>> storage. We have total 11 nodes kafka cluster spread across these two
>> servers.
>> 
>> *Kafka Configuration:*
>> producer.type=async
>> compression.codec=none
>> request.required.acks=-1
>> serializer.class=kafka.serializer.StringEncoder
>> queue.buffering.max.ms=100000
>> batch.num.messages=10000
>> queue.buffering.max.messages=100000
>> default.replication.factor=3
>> controlled.shutdown.enable=true
>> auto.leader.rebalance.enable=true
>> num.network.threads=2
>> num.io.threads=8
>> num.partitions=4
>> log.retention.hours=12
>> log.segment.bytes=536870912
>> log.retention.check.interval.ms=60000
>> log.cleaner.enable=false
>> 
>> *Storm Cluster:*
>> Storm is running with 5 supervisor and 1 nimbus on IBM servers with 48 GB
>> of RAM and 8TB of storage. These servers are shared with hbase cluster.
>> 
>> *Kafka spout configuration*
>> kafkaConfig.bufferSizeBytes = 1024*1024*8;
>> kafkaConfig.fetchSizeBytes = 1024*1024*4;
>> kafkaConfig.forceFromStart = true;
>> 
>> *Topology: StormTopology*
>> Spout           - Partition: 4
>> First Bolt     -  parallelism hint: 6 and Num tasks: 5
>> Second Bolt -  parallelism hint: 5
>> Third Bolt     -   parallelism hint: 3
>> Fourth Bolt   -  parallelism hint: 3 and Num tasks: 4
>> Fifth Bolt      -  parallelism hint: 3
>> Sixth Bolt     -  parallelism hint: 3
>> 
>> *Supervisor configuration:*
>> 
>> storm.local.dir: "/app/storm"
>> storm.zookeeper.port: 2181
>> storm.cluster.mode: "distributed"
>> storm.local.mode.zmq: false
>> supervisor.slots.ports:
>>    - 6700
>>    - 6701
>>    - 6702
>>    - 6703
>> supervisor.worker.start.timeout.secs: 180
>> supervisor.worker.timeout.secs: 30
>> supervisor.monitor.frequency.secs: 3
>> supervisor.heartbeat.frequency.secs: 5
>> supervisor.enable: true
>> 
>> storm.messaging.netty.server_worker_threads: 2
>> storm.messaging.netty.client_worker_threads: 2
>> storm.messaging.netty.buffer_size: 52428800 #50MB buffer
>> storm.messaging.netty.max_retries: 25
>> storm.messaging.netty.max_wait_ms: 1000
>> storm.messaging.netty.min_wait_ms: 100
>> 
>> 
>> supervisor.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
>> worker.childopts: "-Xmx2048m -Djava.net.preferIPv4Stack=true"
>> 
>> 
>> Please let me know if more information needed..
>> 
>> Thanks in advance.
>> 
>> Regards,
>> Riyaz
>>

Re: Help is processing huge data through Kafka-storm cluster

Reply via email to