To be clear, we have this load handled across 3 EC2 instances running Flume so each individually we are asking to handle 3.3k (5k). With 16GB of data in the channel, I would have expected the replay to be faster.
On Wed, Aug 20, 2014 at 12:12 AM, Gary Malouf <malouf.g...@gmail.com> wrote: > Our capacity setting is: > > agent-1.channels.trdbuy-bid-req-ch1.capacity = 100000000 > > > Our current channel size can not be accessed because it still is in this > odd 'replay' mode. There's not logs, but the cpu is cranking on the flume > node and the avro source ports have not yet opened. The pattern we see is > that after anywhere from 15-30 minutes, the ports magically open and we can > continue. > > > This is because we are logging around 10k messages/second and did not want > to lose any data during brief interruptions. > > > On Wed, Aug 20, 2014 at 12:02 AM, Hari Shreedharan < > hshreedha...@cloudera.com> wrote: > >> How large is your channel (and how long does it take to replay?) >> >> Gary Malouf wrote: >> >> >> For the record, we are using Flume 1.4.0 packaged with CDH5.0.2 >> >> >> On Tue, Aug 19, 2014 at 11:55 PM, Gary Malouf <malouf.g...@gmail.com >> <mailto:malouf.g...@gmail.com>> wrote: >> >> We are repeatedly running into cases where the replays of from a >> file channel going to HDFS take an eternity. >> >> I've read this thread >> < >> http://mail-archives.apache.org/mod_mbox/flume-dev/201306.mbox/%3ccahbpyvbmed6pkzkdadmyaw_gc_p7cqdefpsycwknky72tfi...@mail.gmail.com%3E >> >, >> >> but I just am not convinced that our checkpoints are constantly >> being corrupted. >> >> We are seeing messages such as: >> >> 20 Aug 2014 03:52:26,849 INFO [lifecycleSupervisor-1-2] >> (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57) >> >> - Reading checkpoint metadata from >> /opt/flume/brq/ch1/checkpoint/checkpoint.meta >> >> >> How can it be that this takes so long? >> >> >