Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

Peter Moser Wed, 01 Mar 2017 01:58:23 -0800

2017-02-15 20:24 GMT+01:00 Robert Haas <[email protected]>:
> There's no documentation in this patch.  I'm not sure you want to go
> to the trouble of writing SGML documentation until this has been
> reviewed enough that it has a real chance of getting committed, but on
> the other hand we're obviously all struggling to understand what it
> does, so I think if not SGML documentation it at least needs a real
> clear explanation of what the syntax is and does in a README or
> something, even just for initial review.


The attached README explains the NORMALIZE operation step-by-step with
an example. That is, we start from a query input, show how we rewrite
it during parser stage, and show how the final execution generates
result tuples. A similar walkthrough for ALIGN will follow soon.

We are thankful for any suggestion or ideas, to be used to write a
good SGML documentation.

Best regards,
Anton, Michael, Johann, Peter

================================================================================
                                    NORMALIZE
================================================================================

This is an exemplary walkthrough of a temporal NORMALIZE operation, from the
query input, over the query rewrite, until the result tuple outputs during
execution.



EXAMPLE
--------------------------------------------------------------------------------

Question: How many employees work in a department at each time point?

CREATE TABLE empl (name VARCHAR, dept VARCHAR, time INT4RANGE);
INSERT INTO empl VALUES
('Ann', 'DB', '[1,5)'),
('Ann', 'AI', '[3,8)'),
('Bea', 'DB', '[3,9)');


Timeline representation:

    NAME  DEPT      TIME
    Ann   DB        [1,5)       ----        <- Tuple 1, valid from 1 to 5 excl.
    Ann   AI        [3,8)         -----     <- Tuple 2
    Bea   DB        [3,9)         ------    <- Tuple 3
                                123456789   <- Timeline


NORMALIZE query:

    SELECT count(*), dept, time
    FROM (
        empl a NORMALIZE empl b USING(dept) WITH (a.time, b.time)
    ) c
    GROUP BY dept, time;


Result:

    COUNT  DEPT     TIME
    1      DB       [1,3)       --
    1      AI       [3,8)         -----
    2      DB       [3,5)         --
    1      DB       [5,9)           ----
                                123456789  <- Timeline



STEP-BY-STEP EXPLANATION
--------------------------------------------------------------------------------

After receiving the NORMALIZE query (see above) as input, we rewrite it into
the following query:

    SELECT a.*, P1
    FROM (
            SELECT *, row_id() OVER () rn FROM empl
    ) a
    LEFT OUTER JOIN (
            SELECT dept, LOWER(time) P1 FROM empl
            UNION
            SELECT dept, UPPER(time) P1 FROM empl
    ) b
    ON a.dept = b.dept AND P1 <@ a.time
    ORDER BY rn, P1;


Intermediate result: This is the input of our NORMALIZE execution node...
(See Appendix [1] for details)

     TUPID   name | dept | time  | rn | p1
            ------+------+-------+----+----
       A     Ann  | DB   | [1,5) |  1 |  1
       B     Ann  | DB   | [1,5) |  1 |  3
       C     Ann  | AI   | [3,8) |  2 |  3
       D     Bea  | DB   | [3,9) |  3 |  3
       E     Bea  | DB   | [3,9) |  3 |  5



We have three possibilities inside NORMALIZE during execution:
(See Appendix [2] for details)

    1) CURR == PREV
       a) S < CURR.P1   --> generate (CURR, [S, CURR.P1)) and set S = CURR.P1
       b) otherwise     --> fetch next tuple
          if fetched tuple is NULL --> goto(2) forcing CURR != PREV to be true
    2) CURR != PREV
       a) S < PREV.te   --> generate (PREV, [S, PREV.te))
       ... set S = CURR.ts and goto(1) forcing CURR == PREV to be true



Executor steps for our example query:

1) First call of ExecTemporalAdjustment: Fetch tuple A, which is now CURR.
   Set sweepline S = 1 (A.ts), and copy tuple A also into PREV. Goto(1)
   forcing CURR == PREV to be true.

2) (1): CURR == PREV, that is, tuple A is equal to itself.
   (1b): Fetch tuple B; CURR = B, PREV = A

3) (1): A == B
   (1a): S < B.P1 (1 < 3), hence we generate a result tuple:
        (Ann, DB, [1,3))
   ...where the interval is [S, CURR.P1) and all other attributes are
   copied from the current tuple (omitting helper columns: RN and P1)
   We update the sweepline: S = 3 (B.P1).

4) (1): A == B.
   (1b): Fetch tuple C; CURR = C, PREV = B

5) (2): B != C.
   (2a): S < B.te (3 < 5), hence we generate a result tuple:
        (Ann, DB, [3, 5))
   We update the sweepline: S = 3 (C.ts).

6) (1): B == C (forced).
   (1b): Fetch tuple D; CURR = D, PREV = C

7) (2): C != D.
   (2a): S < C.te (3 < 8), hence we generate a result tuple:
        (Ann, AI, [3,8))
   We update the sweepline: S = 3 (D.ts).

8) (1): C == D (forced).
   (1b): Fetch tuple E; CURR = E, PREV = D

9) (1): D == E.
   (1a): S < E.P1 (3 < 5), hence we generate a result tuple:
        (Bea, DB, [3, 5))
   We update the sweepline: S = 5 (E.P1).

10) (1): D == E.
    (1b): Fetch tuple gives NULL

11) (2): D != E (forced).
    (2a): S < E.te (5 < 9), hence we generate a result tuple:
        (Bea, DB, [5, 9))
    We update the sweepline: S = 3 (E.ts).

No new tuple can be fetched, nor is any new result tuple produced, hence we
terminate our executor.

The result of NORMALIZE is as follows:

     name | dept | time
    ------+------+-------
     Ann  | DB   | [1,3)
     Ann  | DB   | [3,5)
     Ann  | AI   | [3,8)
     Bea  | DB   | [3,5)
     Bea  | DB   | [5,9)

The initial goal was to aggregate over "dept" and "time", and output the count
of each. These can now be easily done from this normalized input with standard
(reused) PostgreSQL operators.


Final result is:

     count | dept | time
    -------+------+-------
     1     | DB   | [1,3)
     1     | AI   | [3,8)
     2     | DB   | [3,5)
     1     | DB   | [5,9)



Best regards,
Anton, Michael, Johann, Peter






--------------------------------------------------------------------------------
Appendix:

[1]
After the intermediate result we get an input, which is splitted into so-called
temporal groups. Each of these groups satisfy the USING-condition (a.dept =
b.dept), has overlapping periods, and a row number as ID (rn). The input is
ordered by temporal group ID, and the start and ending timepoints, i.e., P1.
Temporal normalizers do not make a distinction between start and end timepoints
while grouping, therefore we have only one timepoint attribute here (i.e., P1),
which is the union of start and end timepoints. Please note, that each tuple in
a certain rn ID is always equal to any other tuple in that rn ID.

[2]
Additinal information for NORMALIZE:
    - sweepline: to sweep from left-to-right on the time line (S)
    - slots to the current and previous tuple (CURR, PREV)
    - Two tuples A and B are equal, i.e. A == B, if A.rn == B.rn
    - ts means LOWER(time), and te means UPPER(time)

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

Reply via email to