Hi, a first start with flume: http://mapredit.blogspot.com/2011/10/centralized-logfile-management-across.html
Facebook's scribe could also be work for you. - Alex -- Alexander Lorenz http://mapredit.blogspot.com On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote: > Hi all, > > Sorry if it is not appropriate to send one thread into two maillist. > ** > I'm tring to use hadoop and hive to do some log analytic jobs. > > Our system generate lots of logs every day, for example, it produce about > 370GB logs(including lots of log files) yesterday, and every day the logs > increases. > > And we want to use hadoop and hive to replace our old log analysic system. > > We distinguish our logs with logid, we have an log collector which will > collect logs from clients and then generate log files. > > for every logid, there will be one log file every hour, for some logid, > this hourly log file can be 1~2GB > > I have set up an test cluster with hadoop and hive, and I have run some > test which seems good for us. > > For reference, we will create one table in hive for every logid which will > be partitoned by hour. > > Now I have a question, what's the best practice for loading logs files into > hdfs or hive warehouse dir ? > > My first thought is, at the begining of every hour, compress the log file > of the last hour of every logid and then use the hive cmd tool to load > these compressed log files into hdfs. > > using commands like "LOAD DATA LOCAL inpath '$logname' OVERWRITE INTO > TABLE $tablename PARTITION (dt='$h') " > > I think this can work, and I have run some test on our 3-nodes test > clusters. > > But the problem is, there are lots of logid which means there are lots of > log files, so every hour we will have to load lots of files into hdfs > and there is another problem, we will run hourly analysis job on these > hourly collected log files, > which inroduces the problem, because there are lots of log files, if we > load these log files at the same time at the begining of every hour, I > think there will some network flows and there will be data delivery > latency problem. > > For data delivery latency problem, I mean it will take some time for the > log files to be copyed into hdfs, and this will cause our hourly log > analysis job to start later. > > So I want to figure out if we can write or append logs into an compressed > file which is already located in hdfs, and I have posted an thread in the > mailist, and from what I have learned, this is not possible. > > > So, what's the best practice of loading logs into hdfs while using hive to > do log analytic? > > Or what's the common methods to handle problem I have describe above? > > Can anyone give me some advices? > > Thank you very much for your help!