Re: MatchPath UDF usage info ?

Furcy Pin Fri, 05 Sep 2014 07:50:12 -0700

Hi all,

I've just spent some time trying to understand how the 'regex' syntax
worked for matchpath.
As first I thought it worked like usual regex but it was very misleading as
it doesn't. (Perhaps Aster NPath does)


The first thing it does (as I understood) is collecting the whole set of
rows matching the group declared with DISTRIBUTE BY
and sorted according to SORTED BY
It the iterate on that set. I like to think of it as a string (eg: "AABC")

The UDF will then try to match each suffix of the string (eg: "AABC",
"ABC", "BC", "C") one by one, and return one row for each match.

The matching iterates on each symbols of the pattern and for each of them
advances as much as it can in the string.


Here are some examples to help people understand how it works.

String  > Pattern = Matches
"AAB"  > "A.B" = "AB"
"AAB"  > "A+.B" = "AAB","AB"
"BB"    > "B.A*.B" = "BB"
"BAAB"  > "B.A*.B" = "BAAB"

The next example is more tricky : let's consider X is a symbol that is
always true :
"ABABA"  > "A.X*.A" = "ABABA", "ABA"
"ABABAB"  > "A.X*.A" = nothing

To understand what happens more deeply, let's number the letters
"ABABAB"  > "A.X*.A"
"123456"  > "7.8*.9"

The algorithm with proceed as follow :
Trying 123456:
1 (which is an A) is matched by symbol 7 (A)
2345 (BABA) is matched by symbol 8* (X*)
6 (B) is *not* matched by 9 (A)
duh.
Trying 23456:
2 (B) is not matched by symbol 7 (A)
duh.
Trying 3456:
3 (A) is matched by symbol 7 (A)
45 (BA) is matched by symbol 8* (X*)
6 (B) is *not* matched by 9 (A)
duh.
etc.

So, if you want to match people with two events of type A with anything in
between (which would be matched by the *classic* regex "A.*A")
you shall not use the pattern "A.A" because it looks for two consecutive
events
you shall not use the pattern "A.X*.A" with X matching anything (too greedy)
you may use the pattern "A.Z*.A" with Z = not A, but matched against the
string "ABABAB", it will only match ABA (non-greedy match) and not ABABA.

I'm still looking for the pattern that does the same thing as the classic
regex "A.*A"
(for instance if you want to measure the duration between the first and the
last event of type A)

I believe greedy matching requires an automaton to be done efficiently,
which is why you can't greedy match correctly with the current MatchPath
implementation.


Furcy


2014-09-04 2:32 GMT+02:00 Lefty Leverenz <leftylever...@gmail.com>:

> MatchPath.java still exists in Hive trunk and release 0.13.1
> (ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/MatchPath.java).
>
> -- Lefty
>
>
> On Wed, Sep 3, 2014 at 12:39 PM, Muhammad Asif Abbasi <
> asif.abb...@gmail.com> wrote:
>
>> Hi Furcy,
>>
>> Many thanks for your email :)
>>
>> My latest info was that the rename took place due to objections by
>> Teradata, but didn't know if they had actually requested to take it off the
>> distribution entirely.
>>
>> Does anybody else have an idea on the licensing aspect of this? What
>> exactly has Teradata patented? Is it the technique to parse the rows in a
>> such a manner? Any tips/techniques would be highly appreciated.
>>
>> Regards,
>> Asif Abbasi
>>
>>
>>
>>
>> On Wed, Sep 3, 2014 at 5:30 PM, Furcy Pin <furcy....@flaminem.com> wrote:
>>
>>> Hi Muhammad,
>>>
>>> From what I've googled a few months ago on the subject, MatchPath UDF
>>> has been removed from Cloudera and Hortonworks releases because TeraData
>>> claims it violates one of their patent (apparently renaming it did not
>>> suffice).
>>>
>>> I guess that if you really need it, it might be possible to add it
>>> yourself as an external UDF since the code is still available out there,
>>> but I have no idea
>>> whether TeraData would have the right to come after you (or not?) if you
>>> do.
>>>
>>> By the way, if anyone has news on the current situation with MatchPath
>>> and TerraData, that would be welcome.
>>>
>>> Furcy
>>>
>>>
>>>
>>>
>>> 2014-09-03 17:18 GMT+02:00 Muhammad Asif Abbasi <asif.abb...@gmail.com>:
>>>
>>> Hi,
>>>>
>>>> Many thanks for sending these links. Looking forward to more
>>>> documentation around this.
>>>>
>>>> BTW, why does " hive-exec-0.13.0.2.1.1.0-385.jar" not have any class
>>>> files for MatchPath UDF ? Have they been chucked out to a separate JAR
>>>> file?
>>>> I can see that " hive-exec-0.13.0.jar" has the appropriate class
>>>> files, and have tried to use them. They work well with the demo data set
>>>> but we certainly need more documentation around this.
>>>>
>>>> Regards,
>>>> Asif Abbasi
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 26, 2014 at 6:42 AM, Lefty Leverenz <
>>>> leftylever...@gmail.com> wrote:
>>>>
>>>>> Thanks for pointing out that we still need documentation for this in
>>>>> the wiki.  (I've added a doc comment to HIVE-5087
>>>>> <https://issues.apache.org/jira/browse/HIVE-5087>.)  In the meantime,
>>>>> googling "Hive npath" turned up these sources of information:
>>>>>
>>>>>    - https://github.com/hbutani/SQLWindowing/wiki
>>>>>    -
>>>>>    http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive 
>>>>> (slides
>>>>>    20-21)
>>>>>    -
>>>>>
>>>>>
>>>>>    http://www.justinjworkman.com/big-data/using-npath-with-apache-hive/
>>>>>
>>>>>
>>>>> -- Lefty
>>>>>
>>>>>
>>>>> On Mon, Aug 25, 2014 at 8:27 AM, Muhammad Asif Abbasi <
>>>>> asif.abb...@gmail.com> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I am trying to use MatchPath UDF (Previously called NPath). Does
>>>>>> anybody have a document around its syntax and usage?
>>>>>>
>>>>>> Regards,
>>>>>> Asif
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: MatchPath UDF usage info ?

Reply via email to