Hi Hari, I wanted to provide more details on the issues we are seeing as there may be better actions we can take.
As mentioned earlier, we are on the Flume 1.4 release that is packaged with the CDH 5.0.2 distribution. We've been in production with Flume for well over a year, but have just in the last month or two seen more significant issues as our traffic has ramped up We have 2 sets of Flume 'clusters/configurations' - each instance has an agent with 2 or more source-channel-sink combinations. We use file channels for all of them as they allow us to buffer when downstream issues occur (usually when replacing HDFS nodes - will touch on this at the bottom). All of the channels use capacities of 100000000 items (again, using Flume as a buffer during downtime). For reference, we have a 12 node HDFS cluster that these are writing to with a replication factor of 3. An example channel configuration: agent-1.channels.trdbuy-bid-req-ch1.type = file agent-1.channels.trdbuy-bid-req-ch1.checkpointDir = /opt/flume/trdbuy-req-ch1/checkpoint agent-1.channels.trdbuy-bid-req-ch1.dataDirs = /opt/flume/trdbuy-req-ch1/data agent-1.channels.trdbuy-bid-req-ch1.capacity = 100000000 We use HDFS sinks for everything, an example configuration: agent-1.sinks.hdfs-trdbuy-bid-req-sink1.channel = trdbuy-bid-req-ch1 agent-1.sinks.hdfs-trdbuy-bid-req-sink1.type = hdfs agent-1.sinks.hdfs-trdbuy-bid-req-sink1.hdfs.path = hdfs://nn-01:8020/flume/trading/buying/bidrequests/yr=%Y/mo=%m/d=%d/ agent-1.sinks.hdfs-trdbuy-bid-req-sink1.hdfs.filePrefix = %{host}s1 agent-1.sinks.hdfs-trdbuy-bid-req-sink1.hdfs.batchSize = 10000 agent-1.sinks.hdfs-trdbuy-bid-req-sink1.hdfs.rollInterval = 3600 agent-1.sinks.hdfs-trdbuy-bid-req-sink1.hdfs.rollCount = 0 agent-1.sinks.hdfs-trdbuy-bid-req-sink1.hdfs.rollSize = 0 agent-1.sinks.hdfs-trdbuy-bid-req-sink1.hdfs.idleTimeout = 1800 We are running into a few distinct problems: 1) Losing a single datanode causes our flume instances which are under higher volumes (6k messages/second approximately) to be unable to write (others seem to continue to hum along). They do not seem to recover until we restart the instance. I think this exception is related: 20 Aug 2014 01:16:23,942 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.AbstractHDFSWriter.isUnderReplicated:82) - Unexpected error while checking replication factor java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor51.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.flume.sink.hdfs.AbstractHDFSWriter.getNumCurrentReplicas(AbstractHDFSWriter.java:147) at org.apache.flume.sink.hdfs.AbstractHDFSWriter.isUnderReplicated(AbstractHDFSWriter.java:68) at org.apache.flume.sink.hdfs.BucketWriter.shouldRotate(BucketWriter.java:505) at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:440) at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:401) at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68) at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147) at java.lang.Thread.run(Thread.java:722) Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[10.9.178.151:50010, 10.144.197.136:50010], original=[ 10.9.178.151:50010, 10.144.197.136:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:960) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1026) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1174) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486) 2) Stopping a namenode (as expected) causes the Flume instances to not write as well. When the namenode comes back up, Flume does not recover either. It again needs a restart. 3) After restarting Flume in the case of either (1) or (2), it seems to take an eternity (anywhere from 5-30 minutes based on the amount of data in the channel) to open the source ports for listening. During this time, a single cpu on the machine is cranking but no data gets copied to HDFS. Once the ports are allowed to open, the data built up in the channel starts to copy and the catch-up process begins. This last point has caused us to consider other logging solutions, as our system is stuck backed up waiting for Flume to figure itself out. The embedded agents (on the client/writer side) have protected us to date, but they too can be subject to the same backup issues with file channels. Thank you for any help around this, if nothing else this informs other potential users of problems we have been seeing. Gary On Wed, Aug 20, 2014 at 12:50 AM, Gary Malouf <malouf.g...@gmail.com> wrote: > We run on ssd ephemeral storage. > > You can see from the timestamps below how long the replay ended up taking a > > 20 Aug 2014 04:05:20,520 INFO [lifecycleSupervisor-1-1] > (org.apache.flume.channel.file.Log.replay:348) - Replay started > > 20 Aug 2014 04:05:20,532 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.Log.replay:348) - Replay started > > 20 Aug 2014 04:05:20,534 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.Log.replay:360) - Found NextFileID 0, > > from [] > > 20 Aug 2014 04:05:20,544 INFO [lifecycleSupervisor-1-1] > (org.apache.flume.channel.file.Log.replay:360) - Found NextFileID 619 > > , from [/opt/flume/trdbuy-bid-req-ch1/data/log-613, > /opt/flume/trdbuy-bid-req-ch1/data/log-615, /opt/flume/trdbuy-bid-req-ch1/d > > ata/log-618, /opt/flume/trdbuy-bid-req-ch1/data/log-616, > /opt/flume/trdbuy-bid-req-ch1/data/log-617, /opt/flume/trdbuy-bid-req- > > ch1/data/log-612, /opt/flume/trdbuy-bid-req-ch1/data/log-619, > /opt/flume/trdbuy-bid-req-ch1/data/log-614] > > 20 Aug 2014 04:05:20,552 INFO [lifecycleSupervisor-1-1] > (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:53) > > - Starting up with /opt/flume/trdbuy-bid-req-ch1/checkpoint/checkpoint > and /opt/flume/trdbuy-bid-req-ch1/checkpoint/checkpoin > > t.meta > > 20 Aug 2014 04:05:20,553 INFO [lifecycleSupervisor-1-1] > (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57) > > - Reading checkpoint metadata from > /opt/flume/trdbuy-bid-req-ch1/checkpoint/checkpoint.meta > > 20 Aug 2014 04:36:36,024 INFO [lifecycleSupervisor-1-1] > (org.apache.flume.channel.file.FlumeEventQueue.<init>:114) - QueueSet > > population inserting 22678800 took 1874849 > > > > > ---------- Forwarded message ---------- > From: Hari Shreedharan <hshreedha...@cloudera.com> > Date: Wed, Aug 20, 2014 at 12:42 AM > Subject: Re: FileChannel Replays consistently take a long time > To: "user@flume.apache.org" <user@flume.apache.org> > > > Are you running on EBS or ephemeral storage? I have seen IO being slow on > AWS when EBS with provisioned IO is not used. This might be what you are > seeing. > > Also what do you see as checkpoint size when the channel starts up? > > > On Tue, Aug 19, 2014 at 9:18 PM, Gary Malouf <malouf.g...@gmail.com> > wrote: > >> To be clear, we have this load handled across 3 EC2 instances running >> Flume so each individually we are asking to handle 3.3k (5k). With 16GB of >> data in the channel, I would have expected the replay to be faster. >> >> >> On Wed, Aug 20, 2014 at 12:12 AM, Gary Malouf <malouf.g...@gmail.com> >> wrote: >> >>> Our capacity setting is: >>> >>> agent-1.channels.trdbuy-bid-req-ch1.capacity = 100000000 >>> >>> >>> Our current channel size can not be accessed because it still is in this >>> odd 'replay' mode. There's not logs, but the cpu is cranking on the flume >>> node and the avro source ports have not yet opened. The pattern we see is >>> that after anywhere from 15-30 minutes, the ports magically open and we can >>> continue. >>> >>> >>> This is because we are logging around 10k messages/second and did not >>> want to lose any data during brief interruptions. >>> >>> >>> On Wed, Aug 20, 2014 at 12:02 AM, Hari Shreedharan < >>> hshreedha...@cloudera.com> wrote: >>> >>>> How large is your channel (and how long does it take to replay?) >>>> >>>> Gary Malouf wrote: >>>> >>>> >>>> For the record, we are using Flume 1.4.0 packaged with CDH5.0.2 >>>> >>>> >>>> On Tue, Aug 19, 2014 at 11:55 PM, Gary Malouf <malouf.g...@gmail.com >>>> <mailto:malouf.g...@gmail.com>> wrote: >>>> >>>> We are repeatedly running into cases where the replays of from a >>>> file channel going to HDFS take an eternity. >>>> >>>> I've read this thread >>>> < >>>> http://mail-archives.apache.org/mod_mbox/flume-dev/201306.mbox/%3ccahbpyvbmed6pkzkdadmyaw_gc_p7cqdefpsycwknky72tfi...@mail.gmail.com%3E >>>> >, >>>> >>>> but I just am not convinced that our checkpoints are constantly >>>> being corrupted. >>>> >>>> We are seeing messages such as: >>>> >>>> 20 Aug 2014 03:52:26,849 INFO [lifecycleSupervisor-1-2] >>>> >>>> (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57) >>>> >>>> - Reading checkpoint metadata from >>>> /opt/flume/brq/ch1/checkpoint/checkpoint.meta >>>> >>>> >>>> How can it be that this takes so long? >>>> >>>> >>> >> > >