[jira] [Commented] (FLINK-10625) Add MATCH_RECOGNIZE documentation

ASF GitHub Bot (JIRA) Thu, 15 Nov 2018 08:54:24 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16688356#comment-16688356
 ]


ASF GitHub Bot commented on FLINK-10625:
----------------------------------------

alpinegizmo commented on a change in pull request #7070: [FLINK-10625] 
Documentation for MATCH_RECOGNIZE clause
URL: https://github.com/apache/flink/pull/7070#discussion_r233905360
 
 

 ##########
 File path: docs/dev/table/streaming/match_recognize.md
 ##########
 @@ -0,0 +1,843 @@
+---
+title: 'Detecting Patterns in Tables'
+nav-parent_id: streaming_tableapi
+nav-title: 'Detecting Patterns'
+nav-pos: 5
+is_beta: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+It is a common use-case to search for a set of event patterns, especially in 
case of data streams. Flink
+comes with a [complex event processing (CEP) library]({{ site.baseurl 
}}/dev/libs/cep.html) which allows for pattern detection in event streams. 
Furthermore, Flink's
+SQL API provides a relational way of expressing queries with a large set of 
built-in functions and rule-based optimizations that can be used out of the box.
+
+In December 2016, the International Organization for Standardization (ISO) 
released a new version of the SQL standard which includes _Row Pattern 
Recognition in SQL_ ([ISO/IEC TR 
19075-5:2016](https://standards.iso.org/ittf/PubliclyAvailableStandards/c065143_ISO_IEC_TR_19075-5_2016.zip)).
 It allows Flink to consolidate CEP and SQL API using the `MATCH_RECOGNIZE` 
clause for complex event processing in SQL.
+
+A `MATCH_RECOGNIZE` clause enables the following tasks:
+* Logically partition and order the data that is used with `PARTITION BY` and 
`ORDER BY` clauses.
+* Define patterns of rows to seek using the `PATTERN` clause. These patterns 
use a syntax similar to that of regular expressions.
+* Specify logical conditions required to map a row to a row pattern variable 
in the `DEFINE` clause.
+* Define measures, which are expressions usable in other parts of the SQL 
query, in the `MEASURES` clause.
+
+The following example illustates the syntax for basic pattern recognition:
+
+{% highlight sql %}
+SELECT T.aid, T.bid, T.cid
+FROM MyTable
+MATCH_RECOGNIZE (
+  PARTITION BY userid
+  ORDER BY proctime
+  MEASURES
+    A.id AS aid,
+    B.id AS bid,
+    C.id AS cid
+  PATTERN (A B C)
+  DEFINE
+    A AS name = 'a',
+    B AS name = 'b',
+    C AS name = 'c'
+) AS T
+{% endhighlight %}
+
+This page will explain each keyword in more detail and will illustrate more 
complex examples.
+
+<span class="label label-danger">Attention</span> The `MATCH_RECOGNIZE` clause 
implementation in Flink is just a subset of the bigger pattern recognition 
standard. Only the features that are documented in the following sections are 
supported so far. Since the development is still in an early phase, please also 
take a look at the [known limitations](#known-limitations).
+
+* This will be replaced by the TOC
+{:toc}
+
+Introduction and Examples
+-------------------------
+
+### Installation Guide
+
+The pattern recognition feature uses the Apache Flink's CEP library 
internally. In order to be able to use the `MATCH_RECOGNIZE` clause,
+the library needs to be added as a dependency to your Maven project.
+
+{% highlight xml %}
+<dependency>
+  <groupId>org.apache.flink</groupId>
+  <artifactId>flink-cep{{ site.scala_version_suffix }}</artifactId>
+  <version>{{ site.version }}</version>
+</dependency>
+{% endhighlight %}
+
+Alternatively, you can also add the dependency to the cluster classpath (see 
the [dependency section]({{ site.baseurl}}/projectsetup/dependencies.html) for 
more information).
+
+If you want to use the `MATCH_RECOGNIZE` clause in the [SQL Client]({{ 
site.baseurl}}/dev/table/sqlClient.html),
+you don't have to do anything as all the dependencies are included by default.
+
+### SQL Semantics
+
+Every `MATCH_RECOGNIZE` query consists of the following clauses:
+
+* [PARTITION BY](#partitioning) - defines the logical partitioning of the 
table; similar to a `GROUP BY` operation.
+* [ORDER BY](#order-of-events) - specifies how the incoming rows should be 
ordered; this is essential as patterns depend on an order.
+* [MEASURES](#define--measures) - defines output of the clause; similar to a 
`SELECT` clause.
+* [ONE ROW PER MATCH](#output-mode) - output mode which defines how many rows 
per match should be produced.
+* [AFTER MATCH SKIP](#after-match-strategy) - allows to specify where the next 
match should start; this is also a way to control how many distinct matches a 
single event can belong to.
+* [PATTERN](#defining-pattern) - allows constructing patterns that will be 
searched for using a _regular expression_-like syntax.
+* [DEFINE](#define--measures) - this section defines conditions on rows that 
should be met in order to be qualified to the corresponding pattern variable.
+
+<span class="label label-danger">Attention</span> Currently, the 
`MATCH_RECOGNIZE` clause can only be applied to an [append 
table](dynamic_tables.html#update-and-append-queries). Furthermore, it always 
produces
+an append table as well.
+
+### Examples
+
+For our examples, we assume that a table `Ticker` has been registered. The 
table contains prices of stocks at a particular point in time. Each row 
represents an updated characteristic of the ticker.
+
+The table has a following schema:
+
+{% highlight text %}
+Ticker
+     |-- symbol: Long                             # symbol of the stock
+     |-- price: Long                              # price of the stock
+     |-- tax: Long                                # tax liability of the stock
+     |-- rowtime: TimeIndicatorTypeInfo(rowtime)  # point in time when the 
change to those values happened
+{% endhighlight %}
+
+For simplification, we only consider the incoming data for a single stock 
`ACME`. A ticker could look similar to the following table where rows are 
continuously appended.
+
+{% highlight text %}
+symbol         rowtime         price    tax
+======  ====================  ======= =======
+'ACME'  '01-Apr-11 10:00:00'   12      1
+'ACME'  '01-Apr-11 10:00:01'   17      2
+'ACME'  '01-Apr-11 10:00:02'   19      1
+'ACME'  '01-Apr-11 10:00:03'   21      3
+'ACME'  '01-Apr-11 10:00:04'   25      2
+'ACME'  '01-Apr-11 10:00:05'   18      1
+'ACME'  '01-Apr-11 10:00:06'   15      1
+'ACME'  '01-Apr-11 10:00:07'   14      2
+'ACME'  '01-Apr-11 10:00:08'   24      2
+'ACME'  '01-Apr-11 10:00:09'   25      2
+'ACME'  '01-Apr-11 10:00:10'   19      1
+{% endhighlight %}
+
+The task is now to find periods of a constantly decreasing price of a single 
ticker. For this, one could write a query like:
+
+{% highlight sql %}
+SELECT *
+FROM Ticker
+MATCH_RECOGNIZE (
+    PARTITION BY symbol
+    ORDER BY rowtime
+    MEASURES
+        STRT_ROW.rowtime AS start_tstamp,
+        LAST(PRICE_DOWN.rowtime) AS bottom_tstamp,
+        LAST(PRICE_UP.rowtime) AS end_tstamp
+    ONE ROW PER MATCH
+    AFTER MATCH SKIP TO LAST PRICE_UP
+    PATTERN (STRT_ROW PRICE_DOWN+ PRICE_UP)
+    DEFINE
+        PRICE_DOWN AS
+            (LAST(PRICE_DOWN.price, 1) IS NULL AND PRICE_DOWN.price < 
STRT_ROW.price)) OR
+                PRICE_DOWN.price < LAST(PRICE_DOWN.price, 1)
+        PRICE_UP AS
+            PRICE_UP.price > LAST(PRICE_DOWN.price, 1)
+    ) MR;
+{% endhighlight %}
+
+The query partitions the `Ticker` table by the `symbol` column and orders it 
by the `rowtime` time attribute.
+
+The `PATTERN` clause specifies that we are interested in a pattern with a 
starting event `STRT_ROW` that is followed by one or more `PRICE_DOWN` events 
and concluded with a `PRICE_UP` event. If such a pattern can be found, the next 
pattern match will be seeked at the last `PRICE_UP` event as indicated by the 
`AFTER MATCH SKIP TO LAST` clause.
+
+While the `STRT_ROW` pattern variable is only defined for being referenced in 
calculations, the `DEFINE` clause specifies the conditions that need to be met 
for a `PRICE_DOWN` and `PRICE_UP` event.
+
+A pattern variable `PRICE_DOWN` is defined as a row with a price that is 
smaller than the price of the last row that met the `PRICE_DOWN` condition. For 
the initial case or when there is no last row that met the `PRICE_DOWN` 
condition, the price of the row should be smaller than the price of the 
preceding row in the pattern (referenced by `START_ROW`).
+
+A pattern variable `PRICE_UP` is defined as a row with a price that is larger 
than the price of the last row that met the `PRICE_DOWN` condition.
+
+The query produces a summary row for each found period in which the price was 
constantly decreasing.
+
+The exact representation of output rows is defined in `MEASURES` part of the 
query. The number of output rows is defined by the `ONE ROW PER MATCH` output 
mode.
+
+{% highlight text %}
+ symbol       start_tstamp       bottom_tstamp         end_tstamp
+=========  ==================  ==================  ==================
+ACME       01-APR-11 10:00:04  01-APR-11 10:00:07  01-APR-11 10:00:08
+{% endhighlight %}
+
+The resulting row contains the found period of a decreasing price for a ticker 
that started at `01-APR-11 10:00:04` and
+achieved the lowest price at `01-APR-11 10:00:07` that increased again at 
`01-APR-11 10:00:08`.
+
+Partitioning
+------------
+
+It is possible to look for patterns in partitioned data, e.g., trends for a 
single ticker or a particular user. This can be expressed using the `PARTITION 
BY` clause. The clause is similar to using `GROUP BY` for aggregations.
+
+<span class="label label-danger">Attention</span> It is highly advised to 
partition the incoming data because otherwise the `MATCH_RECOGNIZE` clause will 
be translated
+into a non-parallel operator to ensure global ordering.
+
+Order of Events
+---------------
+
+Apache Flink allows for searching for patterns based on time; either 
[processing time or event time](time_attributes.html).
+
+In case of event time, this assumption is very important because it enables to 
sort the events before they are passed to the internal pattern state machine. 
As a consequence, the
 
 Review comment:
   ```suggestion
   In case of event time, the events are sorted before they are passed to the 
internal pattern state machine. As a consequence, the
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add MATCH_RECOGNIZE documentation
> ---------------------------------
>
>                 Key: FLINK-10625
>                 URL: https://issues.apache.org/jira/browse/FLINK-10625
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Documentation, Table API &amp; SQL
>    Affects Versions: 1.7.0
>            Reporter: Till Rohrmann
>            Assignee: Dawid Wysakowicz
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.7.0
>
>
> The newly added {{MATCH_RECOGNIZE}} functionality needs to be documented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-10625) Add MATCH_RECOGNIZE documentation

Reply via email to