Alright, I went throught a couple of combinations, none worked without any flaw. It baffled me why there is no way to get Flume working with HDFS unless both are from Cloudera distribution. So, later today afternoon, I launched a fresh Ubuntu Precise (12.04) and started with Cloudera. Here is the combination that seems to be working in pseudo distributed mode (uses CDH4):
1. Hadoop 2.0.0-cdh4.1.1: Follow the instruction without skipping from here -- https://ccp.cloudera.com/display/CDH4DOC/Installing+CDH4+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode#InstallingCDH4onaSingleLinuxNodeinPseudo-distributedMode-InstallingCDH4withYARNonaSingleLinuxNodeinPseudodistributedmode 2. Flume 1.2.0-cdh4.1.1: in step #1 you already have gotten Cloudera apt-repo, so start from here https://ccp.cloudera.com/display/CDH4DOC/Flume+Installation#FlumeInstallation-InstallingtheFlumeRPMorDebianPackages Config files goes under /etc/hadoop/conf and /etc/flume-ng/conf This combination works as expected. So, expect: 1. When you set hdfs.rollSize = 0 hdfs.rollInterval = 0 hdfs.rollCount = 0 you get a .tmp file of zero byte in HDFS UI until you kill Flume. So, you CANNOT aggragate logs from all the app-servers into one file and tail and watch it in UI. It would have been great if Flume understands that when all roll* setting is zero, means user does not want to roll the file. So, do not create a .tmp file, keep flushing in the data in the final-file based on 'hdfs.batchSize' setting. 2. The good news is, if I kill Flume, the .tmp extention gets deleted and the UI show populated file. So, next is Pig. Lets see how that goes. Thanks for the responses. Nishant
