+1 to ExtractWindowingInfo

On Fri, Feb 21, 2025 at 10:55 AM Danny McCormick via dev <
dev@beam.apache.org> wrote:

> +1 to `ReifyWindowingInfo` (or maybe `ExtractWindowingInfo` or
> `GetWindowing` is a little more understandable to the average user). I
> definitely prefer something which doesn't require extending the set of
> concepts/advanced usages we're exposing through Yaml, especially for a
> feature that I think will not be heavily used (but if you need it, you need
> it).
>
> As a rule, I think we should prefer a simple base language here with
> higher level capabilities available through transforms when possible. It
> will be a little more verbose, but more readable/searchable/learnable, and
> it will preserve the base simplicity for the bulk of use cases.
>
> Thanks,
> Danny
>
> On Thu, Feb 20, 2025 at 3:21 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> Currently our YAML API supports basic streaming, including setting
>> windowing for aggregations, but there's no way to retrieve the
>> windowing/timestamp metadata (short of stepping out of YAML proper and
>> using Python, Java, etc. DoFn). It would probably be quite useful to have a
>> more native way of getting this.
>>
>> One option would be to add a built-in transform to extract this
>> information, e.g. something like
>>
>> - type: ReifyWindowingInfo
>>   config:
>>     new_field1: timestamp
>>     new_field2: window
>>     new_field3: window.end
>>     ...
>>
>> The possible values on the RHS of the map would be a fixed list;
>> supporting things like window.end or pane_info.index would be desirable as
>> their types are schema-compatible (unlike a raw Window or PaneInfo object).
>> One could then use this information in downstream transforms.
>>
>> A second option would be to enhance MapToFields to make this information
>> available. Currently this transform looks like
>>
>> - type: MapToFields
>>   config:
>>     language: python  # java is also supported, javascript, etc.
>> conceivable
>>     fields:
>>       output_field1: input_field + another_input_field
>>       output_field2:
>>         callable: |
>>             def my_inline_function(row):
>>                row.input_field + another_input_field
>>         ...
>>
>> The first case, called the "expression" case, is syntactic sugar that
>> roughly reifies all[1] input fields as locals and translates to the second.
>>
>> For the second case, one could treat this similar to the process method
>> of a DoFn and allow additional annotated arguments (e.g.
>> ParDo.TimestampParam in Python, @Timestamp annotation for Java). We would
>> detect and propagate this up to the generated DoFn.
>>
>> We could consider supporting the "expression" case via some magic
>> variables (or a special namespace) or require the second form for this
>> capability.
>>
>> We could, of course, offer both options as well.
>>
>> Anyone have any opinions or other ideas here?
>>
>> - Robert
>>
>>
>>
>> [1] As an optimization we only capture those locals that appear textually
>> in the body of the expression.
>>
>

Reply via email to