Re: RegexSerDe with Filters

Arun Patel Thu, 14 Jul 2016 13:11:23 -0700

Thanks Dudu.  Couple of additional questions.

1) I need to have the partition column in my view.  I am able to add it to
my view.  But, will it add any performance overhead?
2) I have six 2GB files under date partition and currently I have only one
partition.  I always run into 'OutOfMemoryError: Java heap space' when run
above query which uses group by tid. I considered
increasing hive.tez.container.size and hive.tez.java.opts.  Is this due to
'group by' ?  How to identify the required memory setting? or any other
suggestions?




On Sun, Jul 3, 2016 at 12:14 AM, Markovitz, Dudu <dmarkov...@paypal.com>
wrote:

>
>
> select      ...
>
>            ,sort_array (collect_list(case when att like
> '%ProcessingHandler Message%' then val_num end))    as timestamps_with_dup
>
>            ,sort_array (collect_set (case when att like
> '%ProcessingHandler Message%' then val_num end))    as timestamps_no_dup
>
>
>
>             ...
>
> from        v
>
> group by    tid
>
> ;
>
>
>
> *From:* Arun Patel [mailto:arunp.bigd...@gmail.com]
> *Sent:* Sunday, July 03, 2016 12:39 AM
>
> *To:* user@hive.apache.org
> *Subject:* Re: RegexSerDe with Filters
>
>
>
> Dudu,
>
>
>
> That's an excellent suggestion.  Yes, it works for me.
>
>
>
> But, if there are two lines with "ProcessingHandler Message" and they
> have same TID, I want to create an array of two timestamps.
>
>
>
> Is it possible without join?
>
>
>
>
>
>
>
>
>
> On Sat, Jul 2, 2016 at 3:56 PM, Markovitz, Dudu <dmarkov...@paypal.com>
> wrote:
>
> Hi Venkat
>
>
>
> You don’t necessarily need the three views if your goal is to join them.
>
> You can achieve the same result using a single view and an aggregated
> query.
>
> Please test the following code and see if it works for you or you would
> like to get a different solution.
>
>
>
> Dudu
>
>
>
>
>
> create external table t
>
> (
>
>     c1              string
>
>    ,ts              string
>
>    ,c3              string
>
>    ,log_rec_level   string
>
>    ,tid             string
>
>    ,att             string
>
>    ,val             string
>
>    ,val_num         string
>
> )
>
>     row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>
>     with serdeproperties ('input.regex'='(
> \\S+)\\s+(.{23})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+TID:\\s*(\\d+)\\s+([^:]+):?\\s*((\\d+)?.*)')
>
>     stored as textfile
>
>     location '/tmp/t'
>
> ;
>
>
>
> create view v
>
> as
>
> select      c1
>
>            ,cast (ts as timestamp)      as ts
>
>            ,c3                          as c3
>
>            ,log_rec_level               as log_rec_level
>
>            ,cast (tid as bigint)        as tid
>
>            ,att                         as att
>
>            ,val                         as val
>
>            ,cast (val_num as bigint)    as val_num
>
>
>
> from        t
>
> ;
>
>
>
>
>
> select      tid
>
>
>
>            ,min (case when att like '%ProcessingHandler Message%'   then
> ts         end)  as ts_ProcessingHandler_Message
>
>            ,min (case when att = 'Request received in writer'       then
> ts         end)  as ts_Request_received_in_writer
>
>            ,min (case when att = 'Total time'                       then
> ts         end)  as ts_Total_time
>
>
>
>            ,min (case when att like '%ProcessingHandler Message%'   then
> val_num    end)  as timestamp
>
>            ,min (case when att = 'Total time'                       then
> val_num    end)  as Total_time
>
>
>
> from        v
>
>
>
> group by    tid
>
> ;
>
>
>
>
>
>
>
>
>
> *From:* Arun Patel [mailto:arunp.bigd...@gmail.com]
> *Sent:* Friday, July 01, 2016 9:20 PM
>
>
> *To:* user@hive.apache.org
> *Subject:* Re: RegexSerDe with Filters
>
>
>
> Dudu,
>
>
>
> Thanks for your continued support.  I need one more quick help.  I have
> one more log file as shown below.
>
>
>
> STD-SERV 2016-06-29 12:10:39.142 c.f.c.s.F.ProcessingHandler [INFO]
> TID:101114719017567668 cluster1 ProcessingHandler Message timestamp:
> 1467216639090
>
> STD-SERV 2016-06-29 12:10:39.143 c.f.c.s.F.ProcessingHandler [INFO] TID:
> 101114719017567668 cluster1: Processed request
>
> STD-SERV 2016-06-29 12:10:39.163 c.f.c.s.F.WritingHandler [INFO] TID:
> 101114719017567668 Request received in writer
>
> STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID:
> 101114719017567668 Processed request
>
> STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID:
> 101114719017567668 Total time: 10 ms
>
>
>
> I need to create 3 views for 3 requirements.
>
> 1) create a view to get timestamp, TID number and cluster1 for lines
>  "ProcessingHandler Message timestamp".  But, for this line there is no
> space between TID: and TID number.
>
>
>
> 2) create a view to get timestamp, TID for the lines "Request received in
> writer".  There is a space between TID: and TID number.
>
>
>
> 3) Create a view to get timestamp, TID for the lines "Total time:".  There
> is a space between TID: and TID number.
>
>
>
> How do I create base table and views?  I am planning to join these 3 views
> based on TID.  Do I need to take any special considerations?
>
>
>
> Regards,
>
> Venkat
>
>
>
>
>
>
>
>
>
> On Fri, Jun 24, 2016 at 5:17 PM, Arun Patel <arunp.bigd...@gmail.com>
> wrote:
>
> Dudu, Thanks for the clarification. Looks like I have an issue with my
> Hive installation.  I tried in a different cluster and it works.
>
>
>
> Thanks again.
>
>
>
>
>
> On Fri, Jun 24, 2016 at 4:59 PM, Markovitz, Dudu <dmarkov...@paypal.com>
> wrote:
>
> This is a tested, working code.
>
> If you’re using https://regex101.com ,first replace backslash pairs (\\ )
> with a single backslash (\) and also use the ‘g’ modifier in order to find
> all of the matches.
>
>
>
> The regular expression is -
>
> (\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2}
> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\s+(\S+)\s+\[([^]]+)\]\s+(\S+)\s+:\s+(TID:\s\d+)?\s*(.*)
>
>
>
> I’ll send you a screen shot in private, since you don’t want to expose the
> data.
>
>
>
> Dudu
>
>
>
>
>
> *From:* Arun Patel [mailto:arunp.bigd...@gmail.com]
> *Sent:* Friday, June 24, 2016 9:33 PM
>
>
> *To:* user@hive.apache.org
> *Subject:* Re: RegexSerDe with Filters
>
>
>
> Looks like Regex pattern is not working.  I tested the pattern on
> https://regex101.com/ and it does not find any match.
>
>
>
> Any suggestions?
>
>
>
> On Thu, Jun 23, 2016 at 3:01 PM, Markovitz, Dudu <dmarkov...@paypal.com>
> wrote:
>
> My pleasure.
>
> Please feel free to reach me if needed.
>
>
>
> Dudu
>
>
>
> *From:* Arun Patel [mailto:arunp.bigd...@gmail.com]
> *Sent:* Wednesday, June 22, 2016 2:57 AM
> *To:* user@hive.apache.org
> *Subject:* Re: RegexSerDe with Filters
>
>
>
> Thank you very much, Dudu.  This really helps.
>
>
>
> On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu <dmarkov...@paypal.com>
> wrote:
>
> Hi
>
>
>
> Here is the code (without the log data).
>
>
>
> I’ve created some of the views using different text processing technics.
>
> The rest of the views could be create in similar ways.
>
>
>
>
>
> Dudu
>
>
>
>
>
>
>
> bash
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> hdfs dfs -mkdir -p /tmp/log/20160621
>
> hdfs dfs –put logfile.txt /tmp/log/20160621
>
>
>
>
>
> hive
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> /*
>
>     External table log
>
>
>
>     Defines all common columns + optional column 'tid' which appears in
> most log records + the rest of the log ('txt')
>
>
>
> */
>
>
>
> drop table if exists log;
>
>
>
> create external table log
>
> (
>
>     c1              string
>
>    ,ts              string
>
>    ,ts_frac         string
>
>    ,log_rec_level   string
>
>    ,c4              string
>
>    ,c5              string
>
>    ,tid             string
>
>    ,txt             string
>
> )
>
>     partitioned by (dt date)
>
>     row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>
>     with serdeproperties ('input.regex'='(
> \\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2}
> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})
> \\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)')
>
>     stored as textfile
>
>     location '/tmp/log'
>
> ;
>
>
>
> alter table log add partition (dt=date '2016-06-21') location
> '/tmp/log/20160621';
>
>
>
> select * from log;
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> /*
>
>     View log_v
>
>
>
>     Base view for all other views
>
>
>
> */
>
>
>
> drop view if exists log_v;
>
>
>
> create view log_v
>
> as
>
> select      c1
>
>            ,cast (concat_ws ('.',ts,ts_frac) as timestamp)  as ts
>
>            ,log_rec_level
>
>            ,c4
>
>            ,c5
>
>            ,cast (ltrim(substr (tid,5)) as bigint)          as tid
>
>            ,txt
>
>
>
> from        log
>
> ;
>
>
>
> select * from log_v;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_reaping_path;
>
>
>
> create view log_v_reaping_path
>
> as
>
> select      c1
>
>            ,ts
>
>            ,log_rec_level
>
>            ,c4
>
>            ,c5
>
>            ,substr (txt,15)     as reaping_path
>
>
>
> from        log_V
>
>
>
> where       txt like 'Reaping path: %'
>
> ;
>
>
>
> select * from log_v_reaping_path;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_published_to_kafka;
>
>
>
> create view log_v_published_to_kafka
>
> as
>
> select      c1
>
>            ,ts
>
>            ,log_rec_level
>
>            ,c4
>
>            ,c5
>
>            ,tid
>
>
>
>            ,      ltrim (kv [' Key']      )                 as key
>
>            ,cast (ltrim (kv [' size']     ) as bigint )     as size
>
>            ,      ltrim (kv [' topic']    )                 as topic
>
>            ,cast (ltrim (kv [' partition']) as int    )     as partition
>
>            ,cast (ltrim (kv [' offset']   ) as bigint )     as offset
>
>
>
> from       (select      c1
>
>                        ,ts
>
>                        ,log_rec_level
>
>                        ,c4
>
>                        ,c5
>
>                        ,tid
>
>                        ,str_to_map (substr (txt ,locate
> ('.',txt)+1),',',':')   as kv
>
>
>
>             from        log_V
>
>
>
>             where       txt like 'Published to Kafka. %'
>
>             )
>
>             as t
>
> ;
>
>
>
> select * from log_v_published_to_kafka;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_get_request;
>
>
>
> create view log_v_get_request
>
> as
>
> select      c1
>
>            ,ts
>
>            ,log_rec_level
>
>            ,c4
>
>            ,c5
>
>            ,tid
>
>            ,substr (txt,31) as path
>
>
>
> from        log_V
>
>
>
> where       txt like 'GET request received for path %'
>
> ;
>
>
>
> select * from log_v_get_request;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_unlock_request;
>
>
>
> create view log_v_unlock_request
>
> as
>
> select      c1
>
>            ,ts
>
>            ,log_rec_level
>
>            ,c4
>
>            ,c5
>
>            ,tid
>
>            ,regexp_extract (txt,'rowkey (\\S+)',1)      as rowkey
>
>            ,regexp_extract (txt,'lock id (\\S+)',1)     as lock_id
>
>
>
> from        log_V
>
>
>
> where       txt like 'Unlock request for schema DU %'
>
> ;
>
>
>
>
>
> *From:* Markovitz, Dudu [mailto:dmarkov...@paypal.com]
> *Sent:* Tuesday, June 21, 2016 2:26 PM
> *To:* user@hive.apache.org
> *Subject:* RE: RegexSerDe with Filters
>
>
>
> Hi
>
>
>
> I would suggest creating a single external table with daily partitions and
> multiple views each with the appropriate filtering.
>
> If you’ll send me log sample (~100 rows) I’ll send you an example.
>
>
>
> Dudu
>
>
>
> *From:* Arun Patel [mailto:arunp.bigd...@gmail.com
> <arunp.bigd...@gmail.com>]
> *Sent:* Tuesday, June 21, 2016 1:51 AM
> *To:* user@hive.apache.org
> *Subject:* RegexSerDe with Filters
>
>
>
> Hello Hive Experts,
>
>
>
> I use flume to ingest application specific logs from Syslog to HDFS.
> Currently, I grep the HDFS directory for specific patterns (for multiple
> types of requests) and then create reports.  However, generating reports
> for Weekly and Monthly are not salable.
>
>
>
> I would like to create multiple external on the daily HDFS directory
> partitioned by date with RegexSerde and then create separate Parquet tables
> for every kind of request.
>
>
>
> Question is - How do I create multiple (about 20) RegexSerde tables on
> same data applying filters?  This will be just like 20 grep commands I am
> running today.
>
>
>
> Example:  hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'STORE Request
> Received for APPXXXX' | awk '{print $4, $13, $14, $17, $20}'
>
>                 hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'SCAN
> Request Received for APPYYYY' | awk '{print $4, $14, $19, $21, $22}'
>
>                 hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'TOTAL
> TIME' | awk '{print $4, $24}'
>
>
>
> I would like to create a tables which does this kind of job and then
> writes output to Parquet tables.
>
>
>
> Please let me know how this can be done.  Thank you!
>
>
>
> Regards,
>
> Arun
>
>
>
>
>
>
>
>
>
>
>

Re: RegexSerDe with Filters

Reply via email to