Re: [YAML] Reprocessing failed records

Danny McCormick via dev Tue, 22 Oct 2024 11:46:31 -0700

> (1a) Provide a special operation "Unnest" that takes a single field
> and emits it as the top-level element. This can of course result in
> unschema'd PCollections (which are supported, but generally don't play
> as well with the other operations, including xlang ones).


I like this the most out of the options - why does it have to be unschema'd
though? Couldn't we retain that information from previous steps? If not, I
don't see a way around losing schema info.

On Tue, Oct 22, 2024 at 1:50 PM Robert Bradshaw via dev <[email protected]>
wrote:

> On Sat, Oct 19, 2024 at 8:01 AM XQ Hu via dev <[email protected]> wrote:
> >
> > I probably missed something. Tried this toy example:
> >
> > pipeline:
> >   transforms:
> >     - type: Create
> >       config:
> >         elements: [1, 2, 3, 4, -1]
> >     - type: MapToFields
> >       input: Create
> >       name: MapToFields_1
> >       config:
> >         language: python
> >         fields:
> >           element:
> >             callable: |
> >               import math
> >               def process_num(row):
> >                 return math.sqrt(row.element)
> >         error_handling:
> >           output: my_error_output
> >     - type: LogForTesting
> >       input: MapToFields_1
> >     - type: MapToFields
> >       input: MapToFields_1.my_error_output
> >       name: MapToFields_2
> >       config:
> >         language: python
> >         fields:
> >           element:
> >             callable: |
> >               # return the raw element
> >               def process_error_row(row):
> >                 return row.element[0]
> >     - type: LogForTesting
> >       input: MapToFields_2
> >
> > It looks like MapToFields is good enough to get any information returned
> by error_handling.
>
> Yes, it's possible, but your MapToFields_2 needs to know (and
> reproduce) the structure of the original element to reconstruct it.
>
> > On Sat, Oct 19, 2024 at 2:55 AM Ahmed Abualsaud via dev <
> [email protected]> wrote:
> >>
> >> Another option is to add a second DLQ that outputs just the original
> rows, i.e. the user has the option to fetch failed rows with or without
> metadata.
> >> It would take some work on our side to add this second DLQ to existing
> transforms, but that seems pretty straightforward.
>
> Yeah. I would prefer to do it in such a way that one didn't have to
> modify all existing (and future) transforms. Another downside is that
> having two error outputs doesn't play as nicely with error handlers
> (
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/errorhandling/ErrorHandler.html
> ).
>
> Yet another option would be to add a yaml StripErrorMetadata
> transform, as this is the place where it's not convenient to just do a
> map.
>
> >> On Sat, Oct 19, 2024 at 1:03 AM Robert Bradshaw via dev <
> [email protected]> wrote:
> >>>
> >>> I came across an interesting user report at
> >>> https://github.com/apache/beam/issues/32866 which made me realize that
> >>> providing metadata about a bad element in the "bad records" output is
> >>> useful, we don't make it easy to extract the output into a PCollection
> >>> of the original elements. The output schema contains the original
> >>> element as well as metadata about what error occurred, and in an
> >>> ordinary Beam pipeline one could easily apply a Map(lambda error_row:
> >>> error_row.element) but YAML doesn't have Map, just MapToFields
> >>> (primarily to be more schema friendly).
> >>>
> >>> There are a couple of options:
> >>>
> >>> (0) Leave things as they are. One can write
> >>>
> >>> type: MapToFields
> >>> config:
> >>>   fields:
> >>>     fld1: element.fld1
> >>>     fld2: element.fld2
> >>>     ...
> >>>
> >>>
> >>> This is of course a bit ugly as one needs to enumerate (and know) the
> >>> set of original fields.
> >>>
> >>> (1a) Provide a special operation "Unnest" that takes a single field
> >>> and emits it as the top-level element. This can of course result in
> >>> unschema'd PCollections (which are supported, but generally don't play
> >>> as well with the other operations, including xlang ones).
> >>>
> >>> (1b) Just provide a Map. This is a generalization of 1a, but on the
> >>> other hand would be more prone to abuse.
> >>>
> >>> (1c) We could name this
> >>>
> >>> type: MapToFields
> >>> config:
> >>>   fields:
> >>>     *: element
> >>>
> >>> IIRC, we already have the special case of "*" in our join syntax, and
> >>> we could re-use a bunch of the MapToFields infrastructure. But maybe
> >>> it's too obscure?
> >>>
> >>> (2) Add an optional argument to error_handling to omit the metadata.
> >>> This would require a bit of a hack to support ubiquitously, and
> >>> wouldn't solve the more general problem.
> >>>
> >>> Maybe there are some other ideas as well?
>

Re: [YAML] Reprocessing failed records

Reply via email to