We run on ssd ephemeral storage. You can see from the timestamps below how long the replay ended up taking a
20 Aug 2014 04:05:20,520 INFO [lifecycleSupervisor-1-1] (org.apache.flume.channel.file.Log.replay:348) - Replay started 20 Aug 2014 04:05:20,532 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.replay:348) - Replay started 20 Aug 2014 04:05:20,534 INFO [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.Log.replay:360) - Found NextFileID 0, from [] 20 Aug 2014 04:05:20,544 INFO [lifecycleSupervisor-1-1] (org.apache.flume.channel.file.Log.replay:360) - Found NextFileID 619 , from [/opt/flume/trdbuy-bid-req-ch1/data/log-613, /opt/flume/trdbuy-bid-req-ch1/data/log-615, /opt/flume/trdbuy-bid-req-ch1/d ata/log-618, /opt/flume/trdbuy-bid-req-ch1/data/log-616, /opt/flume/trdbuy-bid-req-ch1/data/log-617, /opt/flume/trdbuy-bid-req- ch1/data/log-612, /opt/flume/trdbuy-bid-req-ch1/data/log-619, /opt/flume/trdbuy-bid-req-ch1/data/log-614] 20 Aug 2014 04:05:20,552 INFO [lifecycleSupervisor-1-1] (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:53) - Starting up with /opt/flume/trdbuy-bid-req-ch1/checkpoint/checkpoint and /opt/flume/trdbuy-bid-req-ch1/checkpoint/checkpoin t.meta 20 Aug 2014 04:05:20,553 INFO [lifecycleSupervisor-1-1] (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57) - Reading checkpoint metadata from /opt/flume/trdbuy-bid-req-ch1/checkpoint/checkpoint.meta 20 Aug 2014 04:36:36,024 INFO [lifecycleSupervisor-1-1] (org.apache.flume.channel.file.FlumeEventQueue.<init>:114) - QueueSet population inserting 22678800 took 1874849 ---------- Forwarded message ---------- From: Hari Shreedharan <hshreedha...@cloudera.com> Date: Wed, Aug 20, 2014 at 12:42 AM Subject: Re: FileChannel Replays consistently take a long time To: "user@flume.apache.org" <user@flume.apache.org> Are you running on EBS or ephemeral storage? I have seen IO being slow on AWS when EBS with provisioned IO is not used. This might be what you are seeing. Also what do you see as checkpoint size when the channel starts up? On Tue, Aug 19, 2014 at 9:18 PM, Gary Malouf <malouf.g...@gmail.com> wrote: > To be clear, we have this load handled across 3 EC2 instances running > Flume so each individually we are asking to handle 3.3k (5k). With 16GB of > data in the channel, I would have expected the replay to be faster. > > > On Wed, Aug 20, 2014 at 12:12 AM, Gary Malouf <malouf.g...@gmail.com> > wrote: > >> Our capacity setting is: >> >> agent-1.channels.trdbuy-bid-req-ch1.capacity = 100000000 >> >> >> Our current channel size can not be accessed because it still is in this >> odd 'replay' mode. There's not logs, but the cpu is cranking on the flume >> node and the avro source ports have not yet opened. The pattern we see is >> that after anywhere from 15-30 minutes, the ports magically open and we can >> continue. >> >> >> This is because we are logging around 10k messages/second and did not >> want to lose any data during brief interruptions. >> >> >> On Wed, Aug 20, 2014 at 12:02 AM, Hari Shreedharan < >> hshreedha...@cloudera.com> wrote: >> >>> How large is your channel (and how long does it take to replay?) >>> >>> Gary Malouf wrote: >>> >>> >>> For the record, we are using Flume 1.4.0 packaged with CDH5.0.2 >>> >>> >>> On Tue, Aug 19, 2014 at 11:55 PM, Gary Malouf <malouf.g...@gmail.com >>> <mailto:malouf.g...@gmail.com>> wrote: >>> >>> We are repeatedly running into cases where the replays of from a >>> file channel going to HDFS take an eternity. >>> >>> I've read this thread >>> < >>> http://mail-archives.apache.org/mod_mbox/flume-dev/201306.mbox/%3ccahbpyvbmed6pkzkdadmyaw_gc_p7cqdefpsycwknky72tfi...@mail.gmail.com%3E >>> >, >>> >>> but I just am not convinced that our checkpoints are constantly >>> being corrupted. >>> >>> We are seeing messages such as: >>> >>> 20 Aug 2014 03:52:26,849 INFO [lifecycleSupervisor-1-2] >>> >>> (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57) >>> >>> - Reading checkpoint metadata from >>> /opt/flume/brq/ch1/checkpoint/checkpoint.meta >>> >>> >>> How can it be that this takes so long? >>> >>> >> >