Re: [YAML] Reprocessing failed records

XQ Hu via dev Sat, 19 Oct 2024 08:01:34 -0700

I probably missed something. Tried this toy example:

pipeline:
  transforms:
    - type: Create
      config:
        elements: [1, 2, 3, 4, -1]
    - type: MapToFields
      input: Create
      name: MapToFields_1
      config:
        language: python
        fields:
          element:
            callable: |
              import math
              def process_num(row):
                return math.sqrt(row.element)
        error_handling:
          output: my_error_output
    - type: LogForTesting
      input: MapToFields_1
    - type: MapToFields
      input: MapToFields_1.my_error_output
      name: MapToFields_2
      config:
        language: python
        fields:
          element:
            callable: |
              # return the raw element
              def process_error_row(row):
                return row.element[0]
    - type: LogForTesting
      input: MapToFields_2


It looks like MapToFields is good enough to get any information returned by
error_handling.

On Sat, Oct 19, 2024 at 2:55 AM Ahmed Abualsaud via dev <dev@beam.apache.org>
wrote:

> Another option is to add a second DLQ that outputs just the original rows,
> i.e. the user has the option to fetch failed rows with or without metadata.
> It would take some work on our side to add this second DLQ to
> existing transforms, but that seems pretty straightforward.
>
> On Sat, Oct 19, 2024 at 1:03 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> I came across an interesting user report at
>> https://github.com/apache/beam/issues/32866 which made me realize that
>> providing metadata about a bad element in the "bad records" output is
>> useful, we don't make it easy to extract the output into a PCollection
>> of the original elements. The output schema contains the original
>> element as well as metadata about what error occurred, and in an
>> ordinary Beam pipeline one could easily apply a Map(lambda error_row:
>> error_row.element) but YAML doesn't have Map, just MapToFields
>> (primarily to be more schema friendly).
>>
>> There are a couple of options:
>>
>> (0) Leave things as they are. One can write
>>
>> type: MapToFields
>> config:
>>   fields:
>>     fld1: element.fld1
>>     fld2: element.fld2
>>     ...
>>
>>
>> This is of course a bit ugly as one needs to enumerate (and know) the
>> set of original fields.
>>
>> (1a) Provide a special operation "Unnest" that takes a single field
>> and emits it as the top-level element. This can of course result in
>> unschema'd PCollections (which are supported, but generally don't play
>> as well with the other operations, including xlang ones).
>>
>> (1b) Just provide a Map. This is a generalization of 1a, but on the
>> other hand would be more prone to abuse.
>>
>> (1c) We could name this
>>
>> type: MapToFields
>> config:
>>   fields:
>>     *: element
>>
>> IIRC, we already have the special case of "*" in our join syntax, and
>> we could re-use a bunch of the MapToFields infrastructure. But maybe
>> it's too obscure?
>>
>> (2) Add an optional argument to error_handling to omit the metadata.
>> This would require a bit of a hack to support ubiquitously, and
>> wouldn't solve the more general problem.
>>
>> Maybe there are some other ideas as well?
>>
>

Re: [YAML] Reprocessing failed records

Reply via email to