Dudu, Thanks for the clarification. Looks like I have an issue with my Hive installation. I tried in a different cluster and it works.
Thanks again. On Fri, Jun 24, 2016 at 4:59 PM, Markovitz, Dudu <dmarkov...@paypal.com> wrote: > This is a tested, working code. > > If you’re using https://regex101.com ,first replace backslash pairs (\\ ) > with a single backslash (\) and also use the ‘g’ modifier in order to find > all of the matches. > > > > The regular expression is - > > (\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2} > [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\s+(\S+)\s+\[([^]]+)\]\s+(\S+)\s+:\s+(TID:\s\d+)?\s*(.*) > > > > I’ll send you a screen shot in private, since you don’t want to expose the > data. > > > > Dudu > > > > > > *From:* Arun Patel [mailto:arunp.bigd...@gmail.com] > *Sent:* Friday, June 24, 2016 9:33 PM > > *To:* user@hive.apache.org > *Subject:* Re: RegexSerDe with Filters > > > > Looks like Regex pattern is not working. I tested the pattern on > https://regex101.com/ and it does not find any match. > > > > Any suggestions? > > > > On Thu, Jun 23, 2016 at 3:01 PM, Markovitz, Dudu <dmarkov...@paypal.com> > wrote: > > My pleasure. > > Please feel free to reach me if needed. > > > > Dudu > > > > *From:* Arun Patel [mailto:arunp.bigd...@gmail.com] > *Sent:* Wednesday, June 22, 2016 2:57 AM > *To:* user@hive.apache.org > *Subject:* Re: RegexSerDe with Filters > > > > Thank you very much, Dudu. This really helps. > > > > On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu <dmarkov...@paypal.com> > wrote: > > Hi > > > > Here is the code (without the log data). > > > > I’ve created some of the views using different text processing technics. > > The rest of the views could be create in similar ways. > > > > > > Dudu > > > > > > > > bash > > > ---------------------------------------------------------------------------------------------------- > > > > hdfs dfs -mkdir -p /tmp/log/20160621 > > hdfs dfs –put logfile.txt /tmp/log/20160621 > > > > > > hive > > > ---------------------------------------------------------------------------------------------------- > > > > /* > > External table log > > > > Defines all common columns + optional column 'tid' which appears in > most log records + the rest of the log ('txt') > > > > */ > > > > drop table if exists log; > > > > create external table log > > ( > > c1 string > > ,ts string > > ,ts_frac string > > ,log_rec_level string > > ,c4 string > > ,c5 string > > ,tid string > > ,txt string > > ) > > partitioned by (dt date) > > row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' > > with serdeproperties ('input.regex'='( > \\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2} > [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3}) > \\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)') > > stored as textfile > > location '/tmp/log' > > ; > > > > alter table log add partition (dt=date '2016-06-21') location > '/tmp/log/20160621'; > > > > select * from log; > > > ---------------------------------------------------------------------------------------------------- > > > > /* > > View log_v > > > > Base view for all other views > > > > */ > > > > drop view if exists log_v; > > > > create view log_v > > as > > select c1 > > ,cast (concat_ws ('.',ts,ts_frac) as timestamp) as ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,cast (ltrim(substr (tid,5)) as bigint) as tid > > ,txt > > > > from log > > ; > > > > select * from log_v; > > > > > ---------------------------------------------------------------------------------------------------- > > > > drop view if exists log_v_reaping_path; > > > > create view log_v_reaping_path > > as > > select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,substr (txt,15) as reaping_path > > > > from log_V > > > > where txt like 'Reaping path: %' > > ; > > > > select * from log_v_reaping_path; > > > > > ---------------------------------------------------------------------------------------------------- > > > > drop view if exists log_v_published_to_kafka; > > > > create view log_v_published_to_kafka > > as > > select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,tid > > > > , ltrim (kv [' Key'] ) as key > > ,cast (ltrim (kv [' size'] ) as bigint ) as size > > , ltrim (kv [' topic'] ) as topic > > ,cast (ltrim (kv [' partition']) as int ) as partition > > ,cast (ltrim (kv [' offset'] ) as bigint ) as offset > > > > from (select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,tid > > ,str_to_map (substr (txt ,locate > ('.',txt)+1),',',':') as kv > > > > from log_V > > > > where txt like 'Published to Kafka. %' > > ) > > as t > > ; > > > > select * from log_v_published_to_kafka; > > > > > ---------------------------------------------------------------------------------------------------- > > > > drop view if exists log_v_get_request; > > > > create view log_v_get_request > > as > > select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,tid > > ,substr (txt,31) as path > > > > from log_V > > > > where txt like 'GET request received for path %' > > ; > > > > select * from log_v_get_request; > > > > > ---------------------------------------------------------------------------------------------------- > > > > drop view if exists log_v_unlock_request; > > > > create view log_v_unlock_request > > as > > select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,tid > > ,regexp_extract (txt,'rowkey (\\S+)',1) as rowkey > > ,regexp_extract (txt,'lock id (\\S+)',1) as lock_id > > > > from log_V > > > > where txt like 'Unlock request for schema DU %' > > ; > > > > > > *From:* Markovitz, Dudu [mailto:dmarkov...@paypal.com] > *Sent:* Tuesday, June 21, 2016 2:26 PM > *To:* user@hive.apache.org > *Subject:* RE: RegexSerDe with Filters > > > > Hi > > > > I would suggest creating a single external table with daily partitions and > multiple views each with the appropriate filtering. > > If you’ll send me log sample (~100 rows) I’ll send you an example. > > > > Dudu > > > > *From:* Arun Patel [mailto:arunp.bigd...@gmail.com > <arunp.bigd...@gmail.com>] > *Sent:* Tuesday, June 21, 2016 1:51 AM > *To:* user@hive.apache.org > *Subject:* RegexSerDe with Filters > > > > Hello Hive Experts, > > > > I use flume to ingest application specific logs from Syslog to HDFS. > Currently, I grep the HDFS directory for specific patterns (for multiple > types of requests) and then create reports. However, generating reports > for Weekly and Monthly are not salable. > > > > I would like to create multiple external on the daily HDFS directory > partitioned by date with RegexSerde and then create separate Parquet tables > for every kind of request. > > > > Question is - How do I create multiple (about 20) RegexSerde tables on > same data applying filters? This will be just like 20 grep commands I am > running today. > > > > Example: hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'STORE Request > Received for APPXXXX' | awk '{print $4, $13, $14, $17, $20}' > > hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'SCAN > Request Received for APPYYYY' | awk '{print $4, $14, $19, $21, $22}' > > hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'TOTAL > TIME' | awk '{print $4, $24}' > > > > I would like to create a tables which does this kind of job and then > writes output to Parquet tables. > > > > Please let me know how this can be done. Thank you! > > > > Regards, > > Arun > > > > >