On Wed, May 15, 2019 at 8:51 PM Kenneth Knowles <[email protected]> wrote:
>
> On Wed, May 15, 2019 at 3:05 AM Robert Bradshaw <[email protected]> wrote:
>>
>> Isn't there an API for concisely computing new fields from old ones?
>> Perhaps these expressions could contain references to metadata value
>> such as timestamp. Otherwise,
>
> Even so, being able to refer to the timestamp implies something about its 
> presence in a namespace, shared with other user-decided names.

I was thinking that functions may live in a different namespace than fields.

> And it may be nice for users to use that API within the composite 
> SqlTransform. I think there are a lot of options.
>
>> Rather than withMetadata reifying the value as a nested field, with
>> the timestamp, window, etc. at the top level, one could let it take a
>> field name argument that attaches all the metadata as an extra
>> (struct-like) field. This would be like attachX, but without having to
>> have a separate method for every X.
>
> If you leave the input field names at the top level, then any "attach" style 
> API requires choosing a name that doesn't conflict with input field names. 
> You can't write a generic transform that works with all inputs. I think it is 
> much simpler to move the input field all into a nested row/struct. Putting 
> all the metadata in a second nested row/struct is just as good as top-level, 
> perhaps. But moving the input into the struct/row is important.

Very good point about writing generic transforms. It does mean a lot
of editing if one decides one wants to access the metadata field(s)
after-the-fact. (I also don't think we need to put the metadata in a
nested struct if the value is.)

>> It seems restrictive to only consider this a a special mode for
>> SqlTransform rather than a more generic operation. (For SQL, my first
>> instinct would be to just make this a special function like
>> element_timestamp(), but there is some ambiguity there when there are
>> multiple tables in the expression.)
>
> I would propose it as both: we already have some Reify transforms, and you 
> could make a general operation that does this small data preparation easily. 
> I think the proposal is just to add a convenience build method on 
> SqlTransform to include the underlying functionality as part of the 
> composite, which we really already have.
>
> I don't think we should extend SQL with built-in functions for 
> element_timestamp() and things like that, because SQL already has TIMESTAMP 
> columns and it is very natural to use SQL on unbounded relations where the 
> timestamp is just part of the data.

That's why I was suggesting a single element_metadata() rather than
exploding each one out.

Do you have a pointer to what the TIMESTAMP columns are? (I'm assuming
this is a special field, but distinct from the metadata timestamp?)

>> On Wed, May 15, 2019 at 5:03 AM Reza Rokni <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > One use case would be when dealing with the windowing functions for 
>> > example:
>> >
>> > SELECT f_int, COUNT(*) , TUMBLE_START(f_timestamp, INTERVAL '1' HOUR) 
>> > tumble_start
>> >   FROM PCOLLECTION
>> >   GROUP BY
>> >     f_int,
>> >     TUMBLE(f_timestamp, INTERVAL '1' HOUR)
>> >
>> > For an element which is using Metadata to inform the EvenTime of the 
>> > element, rather than data within the element itself, I would need to 
>> > create a new schema which added the timestamp as a field. I think other 
>> > examples which maybe interesting is getting the value of a row with the 
>> > max/min timestamp. None of this would be difficult but it does feel a 
>> > little on the verbose side and also makes the pipeline a little harder to 
>> > read.
>> >
>> > Cheers
>> > Reza
>> >
>> >
>> >
>> >
>> >
>> > From: Kenneth Knowles <[email protected]>
>> > Date: Wed, 15 May 2019 at 01:15
>> > To: dev
>> >
>> >> We have support for nested rows so this should be easy. The .withMetadata 
>> >> would reify the struct, moving from Row to WindowedValue<Row> if I 
>> >> understand it...
>> >>
>> >> SqlTransform.query("SELECT field1 from PCOLLECTION"):
>> >>
>> >>     Schema = {
>> >>       field1: type1,
>> >>       field2: type2
>> >>     }
>> >>
>> >>     SqlTransform.query(...)
>> >>
>> >> SqlTransform.withMetadata().query("SELECT event_timestamp, value.field1 
>> >> FROM PCOLLECTION")
>> >>
>> >>     Derived schema = {
>> >>       event_timestamp: TIMESTAMP,
>> >>       pane_info: { ... }
>> >>       value: {
>> >>         field1: type1,
>> >>         field2: type2,
>> >>         ...
>> >>       }
>> >>     }
>> >>
>> >> SqlTransform would expand into a different composite, and it would be a 
>> >> straightforward ParDo to adjust the data, possibly automatic via the new 
>> >> schema conversions.
>> >>
>> >> Embedding the window would be a bit wonky, something like { 
>> >> end_of_window: TIMESTAMP, encoded_window: bytes } which would be 
>> >> expensive due to encoding. But timestamp and pane info not so bad.
>> >>
>> >> Kenn
>> >>
>> >> From: Anton Kedin <[email protected]>
>> >> Date: Tue, May 14, 2019 at 9:17 AM
>> >> To: <[email protected]>
>> >>
>> >>> Reza, can you share more thoughts on how you think this can work 
>> >>> end-to-end?
>> >>>
>> >>> Currently the approach is that populating the rows with the data happens 
>> >>> before the SqlTransform, and within the query you can only use the 
>> >>> things that are already in the rows or in the catalog/schema (or 
>> >>> built-in things). In general case populating the rows with any data can 
>> >>> be solved via a ParDo before SqlTransform. Do you think this approach 
>> >>> lacks something or maybe too verbose?
>> >>>
>> >>> My thoughts on this, lacking more info or concrete examples: in order to 
>> >>> access a timestamp value from within a query there has to be a syntax 
>> >>> for it. Field access expressions or function calls are the only things 
>> >>> that come to mind among existing syntax features that would allow that. 
>> >>> Making timestamp a field of the data row makes more sense to me here 
>> >>> because in Beam it is already a part of the element. It's not a result 
>> >>> of a function call and it's already easily accessible, doesn't make 
>> >>> sense to build extra functions here. One of the problems with both 
>> >>> approaches however is the potential conflicts with the existing schema 
>> >>> of the data elements (or the schema/catalog of the data source in 
>> >>> general). E.g. if we add a magical "event_timestamp" column or 
>> >>> "event_timestamp()" function there may potentially already exist a field 
>> >>> or a function in the schema with this name. This can be solved in couple 
>> >>> of ways, but we will probably want to provide a configuration mechanism 
>> >>> to assign a different field/function names in case of conflicts.
>> >>>
>> >>> Given that, it may make sense to allow users to attach the whole pane 
>> >>> info or some subset of it to the row (e.g. only the timestamp), and make 
>> >>> that configurable. However I am not sure whether exposing something like 
>> >>> pane info is enough and will cover a lot of useful cases. Plus adding 
>> >>> methods like `attachTimestamp("fieldname")` or 
>> >>> `attachWindowInfo("fieldname")` might open a portal to ever-increasing 
>> >>> collection of these `attachX()`, `attachY()` that can make the API less 
>> >>> usable. If on the other hand we would make it more generic then it will 
>> >>> probably have to look a lot like a ParDo or MapElements.via() anyway. 
>> >>> And at that point the question would be whether it makes sense to build 
>> >>> something extra that probably looks and functions like an existing 
>> >>> feature.
>> >>>
>> >>> Regards,
>> >>> Anton
>> >>>
>> >>>
>> >>>
>> >>> From: Andrew Pilloud <[email protected]>
>> >>> Date: Tue, May 14, 2019 at 7:29 AM
>> >>> To: dev
>> >>>
>> >>>> Hi Reza,
>> >>>>
>> >>>> Where will this metadata be coming from? Beam SQL is tightly coupled 
>> >>>> with the schema of the PCollection, so adding fields not in the data 
>> >>>> would be difficult.
>> >>>>
>> >>>> If what you want is the timestamp out of the DoFn.ProcessContext we 
>> >>>> might be able to add a SQL function to fetch that.
>> >>>>
>> >>>> Andrew
>> >>>>
>> >>>> From: Reza Rokni <[email protected]>
>> >>>> Date: Tue, May 14, 2019, 1:08 AM
>> >>>> To: <[email protected]>
>> >>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>> What are folks thoughts about adding something like 
>> >>>>> SqlTransform.withMetadata().query(...)to enable users to be able to 
>> >>>>> access things like Timestamp information from within the query without 
>> >>>>> having to refiy the information into the element itself?
>> >>>>>
>> >>>>> Cheers
>> >>>>> Reza
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>>
>> >>>>> This email may be confidential and privileged. If you received this 
>> >>>>> communication by mistake, please don't forward it to anyone else, 
>> >>>>> please erase all copies and attachments, and please let me know that 
>> >>>>> it has gone to the wrong person.
>> >>>>>
>> >>>>> The above terms reflect a potential business arrangement, are provided 
>> >>>>> solely as a basis for further discussion, and are not intended to be 
>> >>>>> and do not constitute a legally binding obligation. No legally binding 
>> >>>>> obligations will be created, implied, or inferred until an agreement 
>> >>>>> in final form is executed in writing by all parties involved.
>> >
>> >
>> >
>> > --
>> >
>> > This email may be confidential and privileged. If you received this 
>> > communication by mistake, please don't forward it to anyone else, please 
>> > erase all copies and attachments, and please let me know that it has gone 
>> > to the wrong person.
>> >
>> > The above terms reflect a potential business arrangement, are provided 
>> > solely as a basis for further discussion, and are not intended to be and 
>> > do not constitute a legally binding obligation. No legally binding 
>> > obligations will be created, implied, or inferred until an agreement in 
>> > final form is executed in writing by all parties involved.

Reply via email to