Thanks Dudu. Couple of additional questions. 1) I need to have the partition column in my view. I am able to add it to my view. But, will it add any performance overhead? 2) I have six 2GB files under date partition and currently I have only one partition. I always run into 'OutOfMemoryError: Java heap space' when run above query which uses group by tid. I considered increasing hive.tez.container.size and hive.tez.java.opts. Is this due to 'group by' ? How to identify the required memory setting? or any other suggestions?
On Sun, Jul 3, 2016 at 12:14 AM, Markovitz, Dudu <dmarkov...@paypal.com> wrote: > > > select ... > > ,sort_array (collect_list(case when att like > '%ProcessingHandler Message%' then val_num end)) as timestamps_with_dup > > ,sort_array (collect_set (case when att like > '%ProcessingHandler Message%' then val_num end)) as timestamps_no_dup > > > > ... > > from v > > group by tid > > ; > > > > *From:* Arun Patel [mailto:arunp.bigd...@gmail.com] > *Sent:* Sunday, July 03, 2016 12:39 AM > > *To:* user@hive.apache.org > *Subject:* Re: RegexSerDe with Filters > > > > Dudu, > > > > That's an excellent suggestion. Yes, it works for me. > > > > But, if there are two lines with "ProcessingHandler Message" and they > have same TID, I want to create an array of two timestamps. > > > > Is it possible without join? > > > > > > > > > > On Sat, Jul 2, 2016 at 3:56 PM, Markovitz, Dudu <dmarkov...@paypal.com> > wrote: > > Hi Venkat > > > > You don’t necessarily need the three views if your goal is to join them. > > You can achieve the same result using a single view and an aggregated > query. > > Please test the following code and see if it works for you or you would > like to get a different solution. > > > > Dudu > > > > > > create external table t > > ( > > c1 string > > ,ts string > > ,c3 string > > ,log_rec_level string > > ,tid string > > ,att string > > ,val string > > ,val_num string > > ) > > row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' > > with serdeproperties ('input.regex'='( > \\S+)\\s+(.{23})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+TID:\\s*(\\d+)\\s+([^:]+):?\\s*((\\d+)?.*)') > > stored as textfile > > location '/tmp/t' > > ; > > > > create view v > > as > > select c1 > > ,cast (ts as timestamp) as ts > > ,c3 as c3 > > ,log_rec_level as log_rec_level > > ,cast (tid as bigint) as tid > > ,att as att > > ,val as val > > ,cast (val_num as bigint) as val_num > > > > from t > > ; > > > > > > select tid > > > > ,min (case when att like '%ProcessingHandler Message%' then > ts end) as ts_ProcessingHandler_Message > > ,min (case when att = 'Request received in writer' then > ts end) as ts_Request_received_in_writer > > ,min (case when att = 'Total time' then > ts end) as ts_Total_time > > > > ,min (case when att like '%ProcessingHandler Message%' then > val_num end) as timestamp > > ,min (case when att = 'Total time' then > val_num end) as Total_time > > > > from v > > > > group by tid > > ; > > > > > > > > > > *From:* Arun Patel [mailto:arunp.bigd...@gmail.com] > *Sent:* Friday, July 01, 2016 9:20 PM > > > *To:* user@hive.apache.org > *Subject:* Re: RegexSerDe with Filters > > > > Dudu, > > > > Thanks for your continued support. I need one more quick help. I have > one more log file as shown below. > > > > STD-SERV 2016-06-29 12:10:39.142 c.f.c.s.F.ProcessingHandler [INFO] > TID:101114719017567668 cluster1 ProcessingHandler Message timestamp: > 1467216639090 > > STD-SERV 2016-06-29 12:10:39.143 c.f.c.s.F.ProcessingHandler [INFO] TID: > 101114719017567668 cluster1: Processed request > > STD-SERV 2016-06-29 12:10:39.163 c.f.c.s.F.WritingHandler [INFO] TID: > 101114719017567668 Request received in writer > > STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID: > 101114719017567668 Processed request > > STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID: > 101114719017567668 Total time: 10 ms > > > > I need to create 3 views for 3 requirements. > > 1) create a view to get timestamp, TID number and cluster1 for lines > "ProcessingHandler Message timestamp". But, for this line there is no > space between TID: and TID number. > > > > 2) create a view to get timestamp, TID for the lines "Request received in > writer". There is a space between TID: and TID number. > > > > 3) Create a view to get timestamp, TID for the lines "Total time:". There > is a space between TID: and TID number. > > > > How do I create base table and views? I am planning to join these 3 views > based on TID. Do I need to take any special considerations? > > > > Regards, > > Venkat > > > > > > > > > > On Fri, Jun 24, 2016 at 5:17 PM, Arun Patel <arunp.bigd...@gmail.com> > wrote: > > Dudu, Thanks for the clarification. Looks like I have an issue with my > Hive installation. I tried in a different cluster and it works. > > > > Thanks again. > > > > > > On Fri, Jun 24, 2016 at 4:59 PM, Markovitz, Dudu <dmarkov...@paypal.com> > wrote: > > This is a tested, working code. > > If you’re using https://regex101.com ,first replace backslash pairs (\\ ) > with a single backslash (\) and also use the ‘g’ modifier in order to find > all of the matches. > > > > The regular expression is - > > (\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2} > [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\s+(\S+)\s+\[([^]]+)\]\s+(\S+)\s+:\s+(TID:\s\d+)?\s*(.*) > > > > I’ll send you a screen shot in private, since you don’t want to expose the > data. > > > > Dudu > > > > > > *From:* Arun Patel [mailto:arunp.bigd...@gmail.com] > *Sent:* Friday, June 24, 2016 9:33 PM > > > *To:* user@hive.apache.org > *Subject:* Re: RegexSerDe with Filters > > > > Looks like Regex pattern is not working. I tested the pattern on > https://regex101.com/ and it does not find any match. > > > > Any suggestions? > > > > On Thu, Jun 23, 2016 at 3:01 PM, Markovitz, Dudu <dmarkov...@paypal.com> > wrote: > > My pleasure. > > Please feel free to reach me if needed. > > > > Dudu > > > > *From:* Arun Patel [mailto:arunp.bigd...@gmail.com] > *Sent:* Wednesday, June 22, 2016 2:57 AM > *To:* user@hive.apache.org > *Subject:* Re: RegexSerDe with Filters > > > > Thank you very much, Dudu. This really helps. > > > > On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu <dmarkov...@paypal.com> > wrote: > > Hi > > > > Here is the code (without the log data). > > > > I’ve created some of the views using different text processing technics. > > The rest of the views could be create in similar ways. > > > > > > Dudu > > > > > > > > bash > > > ---------------------------------------------------------------------------------------------------- > > > > hdfs dfs -mkdir -p /tmp/log/20160621 > > hdfs dfs –put logfile.txt /tmp/log/20160621 > > > > > > hive > > > ---------------------------------------------------------------------------------------------------- > > > > /* > > External table log > > > > Defines all common columns + optional column 'tid' which appears in > most log records + the rest of the log ('txt') > > > > */ > > > > drop table if exists log; > > > > create external table log > > ( > > c1 string > > ,ts string > > ,ts_frac string > > ,log_rec_level string > > ,c4 string > > ,c5 string > > ,tid string > > ,txt string > > ) > > partitioned by (dt date) > > row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' > > with serdeproperties ('input.regex'='( > \\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2} > [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3}) > \\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)') > > stored as textfile > > location '/tmp/log' > > ; > > > > alter table log add partition (dt=date '2016-06-21') location > '/tmp/log/20160621'; > > > > select * from log; > > > ---------------------------------------------------------------------------------------------------- > > > > /* > > View log_v > > > > Base view for all other views > > > > */ > > > > drop view if exists log_v; > > > > create view log_v > > as > > select c1 > > ,cast (concat_ws ('.',ts,ts_frac) as timestamp) as ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,cast (ltrim(substr (tid,5)) as bigint) as tid > > ,txt > > > > from log > > ; > > > > select * from log_v; > > > > > ---------------------------------------------------------------------------------------------------- > > > > drop view if exists log_v_reaping_path; > > > > create view log_v_reaping_path > > as > > select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,substr (txt,15) as reaping_path > > > > from log_V > > > > where txt like 'Reaping path: %' > > ; > > > > select * from log_v_reaping_path; > > > > > ---------------------------------------------------------------------------------------------------- > > > > drop view if exists log_v_published_to_kafka; > > > > create view log_v_published_to_kafka > > as > > select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,tid > > > > , ltrim (kv [' Key'] ) as key > > ,cast (ltrim (kv [' size'] ) as bigint ) as size > > , ltrim (kv [' topic'] ) as topic > > ,cast (ltrim (kv [' partition']) as int ) as partition > > ,cast (ltrim (kv [' offset'] ) as bigint ) as offset > > > > from (select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,tid > > ,str_to_map (substr (txt ,locate > ('.',txt)+1),',',':') as kv > > > > from log_V > > > > where txt like 'Published to Kafka. %' > > ) > > as t > > ; > > > > select * from log_v_published_to_kafka; > > > > > ---------------------------------------------------------------------------------------------------- > > > > drop view if exists log_v_get_request; > > > > create view log_v_get_request > > as > > select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,tid > > ,substr (txt,31) as path > > > > from log_V > > > > where txt like 'GET request received for path %' > > ; > > > > select * from log_v_get_request; > > > > > ---------------------------------------------------------------------------------------------------- > > > > drop view if exists log_v_unlock_request; > > > > create view log_v_unlock_request > > as > > select c1 > > ,ts > > ,log_rec_level > > ,c4 > > ,c5 > > ,tid > > ,regexp_extract (txt,'rowkey (\\S+)',1) as rowkey > > ,regexp_extract (txt,'lock id (\\S+)',1) as lock_id > > > > from log_V > > > > where txt like 'Unlock request for schema DU %' > > ; > > > > > > *From:* Markovitz, Dudu [mailto:dmarkov...@paypal.com] > *Sent:* Tuesday, June 21, 2016 2:26 PM > *To:* user@hive.apache.org > *Subject:* RE: RegexSerDe with Filters > > > > Hi > > > > I would suggest creating a single external table with daily partitions and > multiple views each with the appropriate filtering. > > If you’ll send me log sample (~100 rows) I’ll send you an example. > > > > Dudu > > > > *From:* Arun Patel [mailto:arunp.bigd...@gmail.com > <arunp.bigd...@gmail.com>] > *Sent:* Tuesday, June 21, 2016 1:51 AM > *To:* user@hive.apache.org > *Subject:* RegexSerDe with Filters > > > > Hello Hive Experts, > > > > I use flume to ingest application specific logs from Syslog to HDFS. > Currently, I grep the HDFS directory for specific patterns (for multiple > types of requests) and then create reports. However, generating reports > for Weekly and Monthly are not salable. > > > > I would like to create multiple external on the daily HDFS directory > partitioned by date with RegexSerde and then create separate Parquet tables > for every kind of request. > > > > Question is - How do I create multiple (about 20) RegexSerde tables on > same data applying filters? This will be just like 20 grep commands I am > running today. > > > > Example: hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'STORE Request > Received for APPXXXX' | awk '{print $4, $13, $14, $17, $20}' > > hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'SCAN > Request Received for APPYYYY' | awk '{print $4, $14, $19, $21, $22}' > > hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'TOTAL > TIME' | awk '{print $4, $24}' > > > > I would like to create a tables which does this kind of job and then > writes output to Parquet tables. > > > > Please let me know how this can be done. Thank you! > > > > Regards, > > Arun > > > > > > > > > > >