Re: RegexSerDe with Filters

Arun Patel Fri, 24 Jun 2016 14:17:30 -0700

Dudu, Thanks for the clarification. Looks like I have an issue with my Hive
installation.  I tried in a different cluster and it works.


Thanks again.


On Fri, Jun 24, 2016 at 4:59 PM, Markovitz, Dudu <dmarkov...@paypal.com>
wrote:

> This is a tested, working code.
>
> If you’re using https://regex101.com ,first replace backslash pairs (\\ )
> with a single backslash (\) and also use the ‘g’ modifier in order to find
> all of the matches.
>
>
>
> The regular expression is -
>
> (\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2}
> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\s+(\S+)\s+\[([^]]+)\]\s+(\S+)\s+:\s+(TID:\s\d+)?\s*(.*)
>
>
>
> I’ll send you a screen shot in private, since you don’t want to expose the
> data.
>
>
>
> Dudu
>
>
>
>
>
> *From:* Arun Patel [mailto:arunp.bigd...@gmail.com]
> *Sent:* Friday, June 24, 2016 9:33 PM
>
> *To:* user@hive.apache.org
> *Subject:* Re: RegexSerDe with Filters
>
>
>
> Looks like Regex pattern is not working.  I tested the pattern on
> https://regex101.com/ and it does not find any match.
>
>
>
> Any suggestions?
>
>
>
> On Thu, Jun 23, 2016 at 3:01 PM, Markovitz, Dudu <dmarkov...@paypal.com>
> wrote:
>
> My pleasure.
>
> Please feel free to reach me if needed.
>
>
>
> Dudu
>
>
>
> *From:* Arun Patel [mailto:arunp.bigd...@gmail.com]
> *Sent:* Wednesday, June 22, 2016 2:57 AM
> *To:* user@hive.apache.org
> *Subject:* Re: RegexSerDe with Filters
>
>
>
> Thank you very much, Dudu.  This really helps.
>
>
>
> On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu <dmarkov...@paypal.com>
> wrote:
>
> Hi
>
>
>
> Here is the code (without the log data).
>
>
>
> I’ve created some of the views using different text processing technics.
>
> The rest of the views could be create in similar ways.
>
>
>
>
>
> Dudu
>
>
>
>
>
>
>
> bash
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> hdfs dfs -mkdir -p /tmp/log/20160621
>
> hdfs dfs –put logfile.txt /tmp/log/20160621
>
>
>
>
>
> hive
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> /*
>
>     External table log
>
>
>
>     Defines all common columns + optional column 'tid' which appears in
> most log records + the rest of the log ('txt')
>
>
>
> */
>
>
>
> drop table if exists log;
>
>
>
> create external table log
>
> (
>
>     c1              string
>
>    ,ts              string
>
>    ,ts_frac         string
>
>    ,log_rec_level   string
>
>    ,c4              string
>
>    ,c5              string
>
>    ,tid             string
>
>    ,txt             string
>
> )
>
>     partitioned by (dt date)
>
>     row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>
>     with serdeproperties ('input.regex'='(
> \\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2}
> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})
> \\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)')
>
>     stored as textfile
>
>     location '/tmp/log'
>
> ;
>
>
>
> alter table log add partition (dt=date '2016-06-21') location
> '/tmp/log/20160621';
>
>
>
> select * from log;
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> /*
>
>     View log_v
>
>
>
>     Base view for all other views
>
>
>
> */
>
>
>
> drop view if exists log_v;
>
>
>
> create view log_v
>
> as
>
> select      c1
>
>            ,cast (concat_ws ('.',ts,ts_frac) as timestamp)  as ts
>
>            ,log_rec_level
>
>            ,c4
>
>            ,c5
>
>            ,cast (ltrim(substr (tid,5)) as bigint)          as tid
>
>            ,txt
>
>
>
> from        log
>
> ;
>
>
>
> select * from log_v;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_reaping_path;
>
>
>
> create view log_v_reaping_path
>
> as
>
> select      c1
>
>            ,ts
>
>            ,log_rec_level
>
>            ,c4
>
>            ,c5
>
>            ,substr (txt,15)     as reaping_path
>
>
>
> from        log_V
>
>
>
> where       txt like 'Reaping path: %'
>
> ;
>
>
>
> select * from log_v_reaping_path;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_published_to_kafka;
>
>
>
> create view log_v_published_to_kafka
>
> as
>
> select      c1
>
>            ,ts
>
>            ,log_rec_level
>
>            ,c4
>
>            ,c5
>
>            ,tid
>
>
>
>            ,      ltrim (kv [' Key']      )                 as key
>
>            ,cast (ltrim (kv [' size']     ) as bigint )     as size
>
>            ,      ltrim (kv [' topic']    )                 as topic
>
>            ,cast (ltrim (kv [' partition']) as int    )     as partition
>
>            ,cast (ltrim (kv [' offset']   ) as bigint )     as offset
>
>
>
> from       (select      c1
>
>                        ,ts
>
>                        ,log_rec_level
>
>                        ,c4
>
>                        ,c5
>
>                        ,tid
>
>                        ,str_to_map (substr (txt ,locate
> ('.',txt)+1),',',':')   as kv
>
>
>
>             from        log_V
>
>
>
>             where       txt like 'Published to Kafka. %'
>
>             )
>
>             as t
>
> ;
>
>
>
> select * from log_v_published_to_kafka;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_get_request;
>
>
>
> create view log_v_get_request
>
> as
>
> select      c1
>
>            ,ts
>
>            ,log_rec_level
>
>            ,c4
>
>            ,c5
>
>            ,tid
>
>            ,substr (txt,31) as path
>
>
>
> from        log_V
>
>
>
> where       txt like 'GET request received for path %'
>
> ;
>
>
>
> select * from log_v_get_request;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_unlock_request;
>
>
>
> create view log_v_unlock_request
>
> as
>
> select      c1
>
>            ,ts
>
>            ,log_rec_level
>
>            ,c4
>
>            ,c5
>
>            ,tid
>
>            ,regexp_extract (txt,'rowkey (\\S+)',1)      as rowkey
>
>            ,regexp_extract (txt,'lock id (\\S+)',1)     as lock_id
>
>
>
> from        log_V
>
>
>
> where       txt like 'Unlock request for schema DU %'
>
> ;
>
>
>
>
>
> *From:* Markovitz, Dudu [mailto:dmarkov...@paypal.com]
> *Sent:* Tuesday, June 21, 2016 2:26 PM
> *To:* user@hive.apache.org
> *Subject:* RE: RegexSerDe with Filters
>
>
>
> Hi
>
>
>
> I would suggest creating a single external table with daily partitions and
> multiple views each with the appropriate filtering.
>
> If you’ll send me log sample (~100 rows) I’ll send you an example.
>
>
>
> Dudu
>
>
>
> *From:* Arun Patel [mailto:arunp.bigd...@gmail.com
> <arunp.bigd...@gmail.com>]
> *Sent:* Tuesday, June 21, 2016 1:51 AM
> *To:* user@hive.apache.org
> *Subject:* RegexSerDe with Filters
>
>
>
> Hello Hive Experts,
>
>
>
> I use flume to ingest application specific logs from Syslog to HDFS.
> Currently, I grep the HDFS directory for specific patterns (for multiple
> types of requests) and then create reports.  However, generating reports
> for Weekly and Monthly are not salable.
>
>
>
> I would like to create multiple external on the daily HDFS directory
> partitioned by date with RegexSerde and then create separate Parquet tables
> for every kind of request.
>
>
>
> Question is - How do I create multiple (about 20) RegexSerde tables on
> same data applying filters?  This will be just like 20 grep commands I am
> running today.
>
>
>
> Example:  hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'STORE Request
> Received for APPXXXX' | awk '{print $4, $13, $14, $17, $20}'
>
>                 hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'SCAN
> Request Received for APPYYYY' | awk '{print $4, $14, $19, $21, $22}'
>
>                 hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'TOTAL
> TIME' | awk '{print $4, $24}'
>
>
>
> I would like to create a tables which does this kind of job and then
> writes output to Parquet tables.
>
>
>
> Please let me know how this can be done.  Thank you!
>
>
>
> Regards,
>
> Arun
>
>
>
>
>

Re: RegexSerDe with Filters

Reply via email to