On Wed, May 15, 2019 at 8:51 PM Kenneth Knowles <[email protected]> wrote: > > On Wed, May 15, 2019 at 3:05 AM Robert Bradshaw <[email protected]> wrote: >> >> Isn't there an API for concisely computing new fields from old ones? >> Perhaps these expressions could contain references to metadata value >> such as timestamp. Otherwise, > > Even so, being able to refer to the timestamp implies something about its > presence in a namespace, shared with other user-decided names.
I was thinking that functions may live in a different namespace than fields. > And it may be nice for users to use that API within the composite > SqlTransform. I think there are a lot of options. > >> Rather than withMetadata reifying the value as a nested field, with >> the timestamp, window, etc. at the top level, one could let it take a >> field name argument that attaches all the metadata as an extra >> (struct-like) field. This would be like attachX, but without having to >> have a separate method for every X. > > If you leave the input field names at the top level, then any "attach" style > API requires choosing a name that doesn't conflict with input field names. > You can't write a generic transform that works with all inputs. I think it is > much simpler to move the input field all into a nested row/struct. Putting > all the metadata in a second nested row/struct is just as good as top-level, > perhaps. But moving the input into the struct/row is important. Very good point about writing generic transforms. It does mean a lot of editing if one decides one wants to access the metadata field(s) after-the-fact. (I also don't think we need to put the metadata in a nested struct if the value is.) >> It seems restrictive to only consider this a a special mode for >> SqlTransform rather than a more generic operation. (For SQL, my first >> instinct would be to just make this a special function like >> element_timestamp(), but there is some ambiguity there when there are >> multiple tables in the expression.) > > I would propose it as both: we already have some Reify transforms, and you > could make a general operation that does this small data preparation easily. > I think the proposal is just to add a convenience build method on > SqlTransform to include the underlying functionality as part of the > composite, which we really already have. > > I don't think we should extend SQL with built-in functions for > element_timestamp() and things like that, because SQL already has TIMESTAMP > columns and it is very natural to use SQL on unbounded relations where the > timestamp is just part of the data. That's why I was suggesting a single element_metadata() rather than exploding each one out. Do you have a pointer to what the TIMESTAMP columns are? (I'm assuming this is a special field, but distinct from the metadata timestamp?) >> On Wed, May 15, 2019 at 5:03 AM Reza Rokni <[email protected]> wrote: >> > >> > Hi, >> > >> > One use case would be when dealing with the windowing functions for >> > example: >> > >> > SELECT f_int, COUNT(*) , TUMBLE_START(f_timestamp, INTERVAL '1' HOUR) >> > tumble_start >> > FROM PCOLLECTION >> > GROUP BY >> > f_int, >> > TUMBLE(f_timestamp, INTERVAL '1' HOUR) >> > >> > For an element which is using Metadata to inform the EvenTime of the >> > element, rather than data within the element itself, I would need to >> > create a new schema which added the timestamp as a field. I think other >> > examples which maybe interesting is getting the value of a row with the >> > max/min timestamp. None of this would be difficult but it does feel a >> > little on the verbose side and also makes the pipeline a little harder to >> > read. >> > >> > Cheers >> > Reza >> > >> > >> > >> > >> > >> > From: Kenneth Knowles <[email protected]> >> > Date: Wed, 15 May 2019 at 01:15 >> > To: dev >> > >> >> We have support for nested rows so this should be easy. The .withMetadata >> >> would reify the struct, moving from Row to WindowedValue<Row> if I >> >> understand it... >> >> >> >> SqlTransform.query("SELECT field1 from PCOLLECTION"): >> >> >> >> Schema = { >> >> field1: type1, >> >> field2: type2 >> >> } >> >> >> >> SqlTransform.query(...) >> >> >> >> SqlTransform.withMetadata().query("SELECT event_timestamp, value.field1 >> >> FROM PCOLLECTION") >> >> >> >> Derived schema = { >> >> event_timestamp: TIMESTAMP, >> >> pane_info: { ... } >> >> value: { >> >> field1: type1, >> >> field2: type2, >> >> ... >> >> } >> >> } >> >> >> >> SqlTransform would expand into a different composite, and it would be a >> >> straightforward ParDo to adjust the data, possibly automatic via the new >> >> schema conversions. >> >> >> >> Embedding the window would be a bit wonky, something like { >> >> end_of_window: TIMESTAMP, encoded_window: bytes } which would be >> >> expensive due to encoding. But timestamp and pane info not so bad. >> >> >> >> Kenn >> >> >> >> From: Anton Kedin <[email protected]> >> >> Date: Tue, May 14, 2019 at 9:17 AM >> >> To: <[email protected]> >> >> >> >>> Reza, can you share more thoughts on how you think this can work >> >>> end-to-end? >> >>> >> >>> Currently the approach is that populating the rows with the data happens >> >>> before the SqlTransform, and within the query you can only use the >> >>> things that are already in the rows or in the catalog/schema (or >> >>> built-in things). In general case populating the rows with any data can >> >>> be solved via a ParDo before SqlTransform. Do you think this approach >> >>> lacks something or maybe too verbose? >> >>> >> >>> My thoughts on this, lacking more info or concrete examples: in order to >> >>> access a timestamp value from within a query there has to be a syntax >> >>> for it. Field access expressions or function calls are the only things >> >>> that come to mind among existing syntax features that would allow that. >> >>> Making timestamp a field of the data row makes more sense to me here >> >>> because in Beam it is already a part of the element. It's not a result >> >>> of a function call and it's already easily accessible, doesn't make >> >>> sense to build extra functions here. One of the problems with both >> >>> approaches however is the potential conflicts with the existing schema >> >>> of the data elements (or the schema/catalog of the data source in >> >>> general). E.g. if we add a magical "event_timestamp" column or >> >>> "event_timestamp()" function there may potentially already exist a field >> >>> or a function in the schema with this name. This can be solved in couple >> >>> of ways, but we will probably want to provide a configuration mechanism >> >>> to assign a different field/function names in case of conflicts. >> >>> >> >>> Given that, it may make sense to allow users to attach the whole pane >> >>> info or some subset of it to the row (e.g. only the timestamp), and make >> >>> that configurable. However I am not sure whether exposing something like >> >>> pane info is enough and will cover a lot of useful cases. Plus adding >> >>> methods like `attachTimestamp("fieldname")` or >> >>> `attachWindowInfo("fieldname")` might open a portal to ever-increasing >> >>> collection of these `attachX()`, `attachY()` that can make the API less >> >>> usable. If on the other hand we would make it more generic then it will >> >>> probably have to look a lot like a ParDo or MapElements.via() anyway. >> >>> And at that point the question would be whether it makes sense to build >> >>> something extra that probably looks and functions like an existing >> >>> feature. >> >>> >> >>> Regards, >> >>> Anton >> >>> >> >>> >> >>> >> >>> From: Andrew Pilloud <[email protected]> >> >>> Date: Tue, May 14, 2019 at 7:29 AM >> >>> To: dev >> >>> >> >>>> Hi Reza, >> >>>> >> >>>> Where will this metadata be coming from? Beam SQL is tightly coupled >> >>>> with the schema of the PCollection, so adding fields not in the data >> >>>> would be difficult. >> >>>> >> >>>> If what you want is the timestamp out of the DoFn.ProcessContext we >> >>>> might be able to add a SQL function to fetch that. >> >>>> >> >>>> Andrew >> >>>> >> >>>> From: Reza Rokni <[email protected]> >> >>>> Date: Tue, May 14, 2019, 1:08 AM >> >>>> To: <[email protected]> >> >>>> >> >>>>> Hi, >> >>>>> >> >>>>> What are folks thoughts about adding something like >> >>>>> SqlTransform.withMetadata().query(...)to enable users to be able to >> >>>>> access things like Timestamp information from within the query without >> >>>>> having to refiy the information into the element itself? >> >>>>> >> >>>>> Cheers >> >>>>> Reza >> >>>>> >> >>>>> >> >>>>> >> >>>>> -- >> >>>>> >> >>>>> This email may be confidential and privileged. If you received this >> >>>>> communication by mistake, please don't forward it to anyone else, >> >>>>> please erase all copies and attachments, and please let me know that >> >>>>> it has gone to the wrong person. >> >>>>> >> >>>>> The above terms reflect a potential business arrangement, are provided >> >>>>> solely as a basis for further discussion, and are not intended to be >> >>>>> and do not constitute a legally binding obligation. No legally binding >> >>>>> obligations will be created, implied, or inferred until an agreement >> >>>>> in final form is executed in writing by all parties involved. >> > >> > >> > >> > -- >> > >> > This email may be confidential and privileged. If you received this >> > communication by mistake, please don't forward it to anyone else, please >> > erase all copies and attachments, and please let me know that it has gone >> > to the wrong person. >> > >> > The above terms reflect a potential business arrangement, are provided >> > solely as a basis for further discussion, and are not intended to be and >> > do not constitute a legally binding obligation. No legally binding >> > obligations will be created, implied, or inferred until an agreement in >> > final form is executed in writing by all parties involved.
