Thank you very much, Dudu. This really helps. On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu <dmarkov...@paypal.com> wrote:
> Hi > > > > Here is the code (without the log data). > > > > I’ve created some of the views using different text processing technics. > > The rest of the views could be create in similar ways. > > > > > > Dudu > > > > > > > > bash > > > ---------------------------------------------------------------------------------------------------- > > > > hdfs dfs -mkdir -p /tmp/log/20160621 > > hdfs dfs –put logfile.txt /tmp/log/20160621 > > > > > > hive > > > ---------------------------------------------------------------------------------------------------- > > > > /* > > External table log > > > > Defines all common columns + optional column 'tid' which appears in > most log records + the rest of the log ('txt') > > > > */ > > > > drop table if exists log; > > > > create external table log > > ( > > c1 string > > ,ts string > > ,ts_frac string > > ,log_rec_level string > > ,c4 string > > ,c5 string > > ,tid string > > ,txt string > > ) > > partitioned by (dt date) > > row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' > > with serdeproperties > ('input.regex'='(\\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2} > [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)') > > stored as textfile > > location '/tmp/log' > > ; > > > > alter table log add partition (dt=date '2016-06-21') location > '/tmp/log/20160621'; > > > > select * from log; > > > ---------------------------------------------------------------------------------------------------- > > > > /* > > View log_v > > > > Base view for all other views > > > > */ > > > > drop view if exists log_v; > > > > create view log_v > > as > > select c1 > > ,cast (concat_ws ('.',ts,ts_frac) as timestamp) as ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,cast (ltrim(substr (tid,5)) as bigint) as tid > > ,txt > > > > from log > > ; > > > > select * from log_v; > > > > > ---------------------------------------------------------------------------------------------------- > > > > drop view if exists log_v_reaping_path; > > > > create view log_v_reaping_path > > as > > select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,substr (txt,15) as reaping_path > > > > from log_V > > > > where txt like 'Reaping path: %' > > ; > > > > select * from log_v_reaping_path; > > > > > ---------------------------------------------------------------------------------------------------- > > > > drop view if exists log_v_published_to_kafka; > > > > create view log_v_published_to_kafka > > as > > select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,tid > > > > , ltrim (kv [' Key'] ) as key > > ,cast (ltrim (kv [' size'] ) as bigint ) as size > > , ltrim (kv [' topic'] ) as topic > > ,cast (ltrim (kv [' partition']) as int ) as partition > > ,cast (ltrim (kv [' offset'] ) as bigint ) as offset > > > > from (select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,tid > > ,str_to_map (substr (txt ,locate > ('.',txt)+1),',',':') as kv > > > > from log_V > > > > where txt like 'Published to Kafka. %' > > ) > > as t > > ; > > > > select * from log_v_published_to_kafka; > > > > > ---------------------------------------------------------------------------------------------------- > > > > drop view if exists log_v_get_request; > > > > create view log_v_get_request > > as > > select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,tid > > ,substr (txt,31) as path > > > > from log_V > > > > where txt like 'GET request received for path %' > > ; > > > > select * from log_v_get_request; > > > > > ---------------------------------------------------------------------------------------------------- > > > > drop view if exists log_v_unlock_request; > > > > create view log_v_unlock_request > > as > > select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,tid > > ,regexp_extract (txt,'rowkey (\\S+)',1) as rowkey > > ,regexp_extract (txt,'lock id (\\S+)',1) as lock_id > > > > from log_V > > > > where txt like 'Unlock request for schema DU %' > > ; > > > > > > *From:* Markovitz, Dudu [mailto:dmarkov...@paypal.com] > *Sent:* Tuesday, June 21, 2016 2:26 PM > *To:* user@hive.apache.org > *Subject:* RE: RegexSerDe with Filters > > > > Hi > > > > I would suggest creating a single external table with daily partitions and > multiple views each with the appropriate filtering. > > If you’ll send me log sample (~100 rows) I’ll send you an example. > > > > Dudu > > > > *From:* Arun Patel [mailto:arunp.bigd...@gmail.com > <arunp.bigd...@gmail.com>] > *Sent:* Tuesday, June 21, 2016 1:51 AM > *To:* user@hive.apache.org > *Subject:* RegexSerDe with Filters > > > > Hello Hive Experts, > > > > I use flume to ingest application specific logs from Syslog to HDFS. > Currently, I grep the HDFS directory for specific patterns (for multiple > types of requests) and then create reports. However, generating reports > for Weekly and Monthly are not salable. > > > > I would like to create multiple external on the daily HDFS directory > partitioned by date with RegexSerde and then create separate Parquet tables > for every kind of request. > > > > Question is - How do I create multiple (about 20) RegexSerde tables on > same data applying filters? This will be just like 20 grep commands I am > running today. > > > > Example: hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'STORE Request > Received for APPXXXX' | awk '{print $4, $13, $14, $17, $20}' > > hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'SCAN > Request Received for APPYYYY' | awk '{print $4, $14, $19, $21, $22}' > > hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'TOTAL > TIME' | awk '{print $4, $24}' > > > > I would like to create a tables which does this kind of job and then > writes output to Parquet tables. > > > > Please let me know how this can be done. Thank you! > > > > Regards, > > Arun >