[YAML] Reprocessing failed records

Robert Bradshaw via dev Fri, 18 Oct 2024 15:04:11 -0700

I came across an interesting user report at
https://github.com/apache/beam/issues/32866 which made me realize that
providing metadata about a bad element in the "bad records" output is
useful, we don't make it easy to extract the output into a PCollection
of the original elements. The output schema contains the original
element as well as metadata about what error occurred, and in an
ordinary Beam pipeline one could easily apply a Map(lambda error_row:
error_row.element) but YAML doesn't have Map, just MapToFields
(primarily to be more schema friendly).


There are a couple of options:

(0) Leave things as they are. One can write

type: MapToFields
config:
  fields:
    fld1: element.fld1
    fld2: element.fld2
    ...


This is of course a bit ugly as one needs to enumerate (and know) the
set of original fields.

(1a) Provide a special operation "Unnest" that takes a single field
and emits it as the top-level element. This can of course result in
unschema'd PCollections (which are supported, but generally don't play
as well with the other operations, including xlang ones).

(1b) Just provide a Map. This is a generalization of 1a, but on the
other hand would be more prone to abuse.

(1c) We could name this

type: MapToFields
config:
  fields:
    *: element

IIRC, we already have the special case of "*" in our join syntax, and
we could re-use a bunch of the MapToFields infrastructure. But maybe
it's too obscure?

(2) Add an optional argument to error_handling to omit the metadata.
This would require a bit of a hack to support ubiquitously, and
wouldn't solve the more general problem.

Maybe there are some other ideas as well?

[YAML] Reprocessing failed records

Reply via email to