+1 to `ReifyWindowingInfo` (or maybe `ExtractWindowingInfo` or `GetWindowing` is a little more understandable to the average user). I definitely prefer something which doesn't require extending the set of concepts/advanced usages we're exposing through Yaml, especially for a feature that I think will not be heavily used (but if you need it, you need it).
As a rule, I think we should prefer a simple base language here with higher level capabilities available through transforms when possible. It will be a little more verbose, but more readable/searchable/learnable, and it will preserve the base simplicity for the bulk of use cases. Thanks, Danny On Thu, Feb 20, 2025 at 3:21 PM Robert Bradshaw via dev <dev@beam.apache.org> wrote: > Currently our YAML API supports basic streaming, including setting > windowing for aggregations, but there's no way to retrieve the > windowing/timestamp metadata (short of stepping out of YAML proper and > using Python, Java, etc. DoFn). It would probably be quite useful to have a > more native way of getting this. > > One option would be to add a built-in transform to extract this > information, e.g. something like > > - type: ReifyWindowingInfo > config: > new_field1: timestamp > new_field2: window > new_field3: window.end > ... > > The possible values on the RHS of the map would be a fixed list; > supporting things like window.end or pane_info.index would be desirable as > their types are schema-compatible (unlike a raw Window or PaneInfo object). > One could then use this information in downstream transforms. > > A second option would be to enhance MapToFields to make this information > available. Currently this transform looks like > > - type: MapToFields > config: > language: python # java is also supported, javascript, etc. > conceivable > fields: > output_field1: input_field + another_input_field > output_field2: > callable: | > def my_inline_function(row): > row.input_field + another_input_field > ... > > The first case, called the "expression" case, is syntactic sugar that > roughly reifies all[1] input fields as locals and translates to the second. > > For the second case, one could treat this similar to the process method of > a DoFn and allow additional annotated arguments (e.g. ParDo.TimestampParam > in Python, @Timestamp annotation for Java). We would detect and propagate > this up to the generated DoFn. > > We could consider supporting the "expression" case via some magic > variables (or a special namespace) or require the second form for this > capability. > > We could, of course, offer both options as well. > > Anyone have any opinions or other ideas here? > > - Robert > > > > [1] As an optimization we only capture those locals that appear textually > in the body of the expression. >