Re: [YAML] Reprocessing failed records

Robert Bradshaw via dev Tue, 22 Oct 2024 12:38:05 -0700

On Tue, Oct 22, 2024 at 11:46 AM Danny McCormick
<dannymccorm...@google.com> wrote:
>
> > (1a) Provide a special operation "Unnest" that takes a single field
> > and emits it as the top-level element. This can of course result in
> > unschema'd PCollections (which are supported, but generally don't play
> > as well with the other operations, including xlang ones).
>
> I like this the most out of the options - why does it have to be unschema'd 
> though? Couldn't we retain that information from previous steps? If not, I 
> don't see a way around losing schema info.


Yes, if the unnested element itself is schema'd, that is preserved. If
it's, say, an int, it will be a bare PCollection of ints. (Which isn't
the end of the world...)

Naming is also still TBD. I just realized that unnest has the meaning
of iteration/flatten in some SQL dialects. For our dynamic
destinations we chose the keyword "only" to indicate that we want to
only write a specified field (as a top level record) rather than the
entire record.

> On Tue, Oct 22, 2024 at 1:50 PM Robert Bradshaw via dev <dev@beam.apache.org> 
> wrote:
>>
>> On Sat, Oct 19, 2024 at 8:01 AM XQ Hu via dev <dev@beam.apache.org> wrote:
>> >
>> > I probably missed something. Tried this toy example:
>> >
>> > pipeline:
>> >   transforms:
>> >     - type: Create
>> >       config:
>> >         elements: [1, 2, 3, 4, -1]
>> >     - type: MapToFields
>> >       input: Create
>> >       name: MapToFields_1
>> >       config:
>> >         language: python
>> >         fields:
>> >           element:
>> >             callable: |
>> >               import math
>> >               def process_num(row):
>> >                 return math.sqrt(row.element)
>> >         error_handling:
>> >           output: my_error_output
>> >     - type: LogForTesting
>> >       input: MapToFields_1
>> >     - type: MapToFields
>> >       input: MapToFields_1.my_error_output
>> >       name: MapToFields_2
>> >       config:
>> >         language: python
>> >         fields:
>> >           element:
>> >             callable: |
>> >               # return the raw element
>> >               def process_error_row(row):
>> >                 return row.element[0]
>> >     - type: LogForTesting
>> >       input: MapToFields_2
>> >
>> > It looks like MapToFields is good enough to get any information returned 
>> > by error_handling.
>>
>> Yes, it's possible, but your MapToFields_2 needs to know (and
>> reproduce) the structure of the original element to reconstruct it.
>>
>> > On Sat, Oct 19, 2024 at 2:55 AM Ahmed Abualsaud via dev 
>> > <dev@beam.apache.org> wrote:
>> >>
>> >> Another option is to add a second DLQ that outputs just the original 
>> >> rows, i.e. the user has the option to fetch failed rows with or without 
>> >> metadata.
>> >> It would take some work on our side to add this second DLQ to existing 
>> >> transforms, but that seems pretty straightforward.
>>
>> Yeah. I would prefer to do it in such a way that one didn't have to
>> modify all existing (and future) transforms. Another downside is that
>> having two error outputs doesn't play as nicely with error handlers
>> (https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/errorhandling/ErrorHandler.html
>> ).
>>
>> Yet another option would be to add a yaml StripErrorMetadata
>> transform, as this is the place where it's not convenient to just do a
>> map.
>>
>> >> On Sat, Oct 19, 2024 at 1:03 AM Robert Bradshaw via dev 
>> >> <dev@beam.apache.org> wrote:
>> >>>
>> >>> I came across an interesting user report at
>> >>> https://github.com/apache/beam/issues/32866 which made me realize that
>> >>> providing metadata about a bad element in the "bad records" output is
>> >>> useful, we don't make it easy to extract the output into a PCollection
>> >>> of the original elements. The output schema contains the original
>> >>> element as well as metadata about what error occurred, and in an
>> >>> ordinary Beam pipeline one could easily apply a Map(lambda error_row:
>> >>> error_row.element) but YAML doesn't have Map, just MapToFields
>> >>> (primarily to be more schema friendly).
>> >>>
>> >>> There are a couple of options:
>> >>>
>> >>> (0) Leave things as they are. One can write
>> >>>
>> >>> type: MapToFields
>> >>> config:
>> >>>   fields:
>> >>>     fld1: element.fld1
>> >>>     fld2: element.fld2
>> >>>     ...
>> >>>
>> >>>
>> >>> This is of course a bit ugly as one needs to enumerate (and know) the
>> >>> set of original fields.
>> >>>
>> >>> (1a) Provide a special operation "Unnest" that takes a single field
>> >>> and emits it as the top-level element. This can of course result in
>> >>> unschema'd PCollections (which are supported, but generally don't play
>> >>> as well with the other operations, including xlang ones).
>> >>>
>> >>> (1b) Just provide a Map. This is a generalization of 1a, but on the
>> >>> other hand would be more prone to abuse.
>> >>>
>> >>> (1c) We could name this
>> >>>
>> >>> type: MapToFields
>> >>> config:
>> >>>   fields:
>> >>>     *: element
>> >>>
>> >>> IIRC, we already have the special case of "*" in our join syntax, and
>> >>> we could re-use a bunch of the MapToFields infrastructure. But maybe
>> >>> it's too obscure?
>> >>>
>> >>> (2) Add an optional argument to error_handling to omit the metadata.
>> >>> This would require a bit of a hack to support ubiquitously, and
>> >>> wouldn't solve the more general problem.
>> >>>
>> >>> Maybe there are some other ideas as well?

Re: [YAML] Reprocessing failed records

Reply via email to