Put the logile into a location on HDFS, and create an external table pointing to that location. The External table should just have one column, a string, CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION '/etl/logfile'
I think that should work. Then Create another table CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY (DAY STRING) STORED AS ORC; copy files into /etc/logfile run this hive file: SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions=10000; SET hive.exec.max.dynamic.partitions.pernode=1000; INSERT INTO logfile select substring(message_line, 0, 17) as ts regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD, regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files, concat('20', substring(messageline, 0, 8)) as day from logfile_etl delete the the files /etl/logfile (or move them to an archival) That will get you a day partitioned (I added the 20 in front of your date so that string sorts well, although it probably would without it, it' early, and I have not had coffee yet) ORC file table (with compression and ORC good ness. The regexs are a little messy, by based on your one line of data, should work. Also: If you have data from pre 2000 obviously, the concat('20' thing needs to be updated. Note, I didn't use a regex on the date... why? It appears to be properly padded data, therefore a substring is fast. This type of stuff has so many ways to skin a cat, so your way may be totally different from my way, but this is how I'd approach it long term. (if it's a one time thing, I may not create the managed the table, but if so, having partitions and ORC files will make things faster). If there are syntax errors I apologize, see earlier disclaimer about lack of proper bean sourced stimulants. On Sat, Nov 23, 2013 at 7:36 AM, Baahu <bahub...@gmail.com> wrote: > Hi, > I have a messy log file which I want to use to create a table, I am only > interested to retrieve 3 columns (time,ADD,files),which are in bold. > Sample entry from log file > *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] * junkjunk|2013-11-23 > 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip > > Can you please let me know how I should go about, regex seems to be way > out,but I am struggling with that as well !! > > Thanks, > Baahu > >