Re: Strategy for Loading Apache Logs

hadoopman Tue, 10 May 2011 15:29:10 -0700

You're pretty much going to overwrite the partition every time you wantto add data to it. I wish there was an append but there isn't. We'rebasically doing it the same way you are (looking at your insert statement).

The challenge I'm running into (doing the same thing you are) is whenI'm bulk loading historical logs into hadoop/hive I see heap memoryerrors appear. However when I load smaller batches I rarely see thosesame out of memory errors go away. I'm curious if you (or anyone) has abetter way of loading historicals.




On 05/10/2011 11:56 AM, bichonfrise74 wrote:

Hi,
My end goal is to load the daily Apache logs. I wish to partition itby date and the group has helped me in giving some advices, but itseems that I am still stuck.
My "daily Apache logs" can contain dates for 2 days ago, yesterday,and today. So, what I did was I created a staging_weblog table andused hive (using SerDe) to load the logs without any partition. Then Iran a select distinct to get all the unique dates. And I convertedthis date into YYYY-MM-DD format.
After this I created a multi-insert statement like this:
from staging_weblog
insert overwrite table real_weblog
partition ( logdate = '<yyyy-mm-dd>')
select * where regexp_extract( logtime, '([^:]*):.*', 1) = 'dd/MMM/yyyy'
The above works fine if there is no existing partition. But if thereis an existing partition, then it 'overwrites' and replace the oldpartition with the new one. The 'overwrite' keyword is mandatory basedon the documentation. But I wish to just append the data to theexisting partition.
Has anyone solved this problem before?
Or let me ask a general question, how do you load your daily Apachelogs into Hadoop so that you can use Hive to process the data?

Re: Strategy for Loading Apache Logs

Reply via email to