Hello all, My setup: - Flume 1.4 - CDH 4.2.2 (2.0.0-cdh4.2.2)
I am testing a simple flume setup with a Sequence Generator Source, a File Channel, and an HDFS Sink (see my flume.conf below). This configuration works as expected until I reboot the cluster’s NameNode or until I restart the HDFS service on the cluster. At this point, it appears that the Flume Agent cannot reconnect to HDFS and must be manually restarted. Since this is not an uncommon occurrence in our production cluster, it is important that Flume is able to reconnect gracefully without any manual intervention. So, how do we fix this HDFS reconnection issue? Here is our flume.conf: appserver.sources = rawtext appserver.channels = testchannel appserver.sinks = test_sink appserver.sources.rawtext.type = seq appserver.sources.rawtext.channels = testchannel appserver.channels.testchannel.type = file appserver.channels.testchannel.capacity = 10000000 appserver.channels.testchannel.minimumRequiredSpace = 214748364800 appserver.channels.testchannel.checkpointDir = /Users/aoneill/Desktop/testchannel/checkpoint appserver.channels.testchannel.dataDirs = /Users/aoneill/Desktop/testchannel/data appserver.channels.testchannel.maxFileSize = 20000000 appserver.sinks.test_sink.type = hdfs appserver.sinks.test_sink.channel = testchannel appserver.sinks.test_sink.hdfs.path = hdfs://cluster01:8020/user/aoneill/flumetest appserver.sinks.test_sink.hdfs.closeTries = 3 appserver.sinks.test_sink.hdfs.filePrefix = events- appserver.sinks.test_sink.hdfs.fileSuffix = .avro appserver.sinks.test_sink.hdfs.fileType = DataStream appserver.sinks.test_sink.hdfs.writeFormat = Text appserver.sinks.test_sink.hdfs.inUsePrefix = inuse- appserver.sinks.test_sink.hdfs.inUseSuffix = .avro appserver.sinks.test_sink.hdfs.rollCount = 100000 appserver.sinks.test_sink.hdfs.rollInterval = 30 appserver.sinks.test_sink.hdfs.rollSize = 10485760 These are the two error message that the Flume Agent outputs constantly after the restart: 2014-08-26 10:47:24,572 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.AbstractHDFSWriter.isUnderReplicated(AbstractHDFSWriter.java:96)] Unexpected error while checking replication factor java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.flume.sink.hdfs.AbstractHDFSWriter.getNumCurrentReplicas(AbstractHDFSWriter.java:162) at org.apache.flume.sink.hdfs.AbstractHDFSWriter.isUnderReplicated(AbstractHDFSWriter.java:82) at org.apache.flume.sink.hdfs.BucketWriter.shouldRotate(BucketWriter.java:452) at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:387) at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:392) at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68) at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147) at java.lang.Thread.run(Thread.java:744) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:525) at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1253) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:891) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:881) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:982) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:779) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448) and 2014-08-26 10:47:29,592 (SinkRunner-PollingRunner-DefaultSinkProcessor) [WARN - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:418)] HDFS IO error java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:525) at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1253) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:891) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:881) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:982) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:779) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448) I can provide additional information if needed. Thank you very much for any insight you are able to provide into this problem. Best, Andrew