Re: RenameFields behaves differently in DirectRunner

Reuven Lax Wed, 02 Jun 2021 12:04:31 -0700

There is no bug in the Coder itself, so that wouldn't catch it. We could
insert CoderProperties.coderDecodeEncodeEqual in a subsequent ParDo, but if
the Direct runner already does an encode/decode before that ParDo, then
that would have fixed the problem before we could see it.


On Wed, Jun 2, 2021 at 11:53 AM Kenneth Knowles <k...@apache.org> wrote:

> Would it be caught by CoderProperties?
>
> Kenn
>
> On Wed, Jun 2, 2021 at 8:16 AM Reuven Lax <re...@google.com> wrote:
>
>> I don't think this bug is schema specific - we created a Java object that
>> is inconsistent with its encoded form, which could happen to any transform.
>>
>> This does seem to be a gap in DirectRunner testing though. It also makes
>> it hard to test using PAssert, as I believe that puts everything in a side
>> input, forcing an encoding/decoding.
>>
>> On Wed, Jun 2, 2021 at 8:12 AM Brian Hulette <bhule...@google.com> wrote:
>>
>>> +dev <d...@beam.apache.org>
>>>
>>> > I bet the DirectRunner is encoding and decoding in between, which
>>> fixes the object.
>>>
>>> Do we need better testing of schema-aware (and potentially other
>>> built-in) transforms in the face of fusion to root out issues like this?
>>>
>>> Brian
>>>
>>> On Wed, Jun 2, 2021 at 5:13 AM Matthew Ouyang <matthew.ouy...@gmail.com>
>>> wrote:
>>>
>>>> I have some other work-related things I need to do this week, so I will
>>>> likely report back on this over the weekend.  Thank you for the
>>>> explanation.  It makes perfect sense now.
>>>>
>>>> On Tue, Jun 1, 2021 at 11:18 PM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> Some more context - the problem is that RenameFields outputs (in this
>>>>> case) Java Row objects that are inconsistent with the actual schema.
>>>>> For example if you have the following schema:
>>>>>
>>>>> Row {
>>>>>    field1: Row {
>>>>>       field2: string
>>>>>     }
>>>>> }
>>>>>
>>>>> And rename field1.field2 -> renamed, you'll get the following schema
>>>>>
>>>>> Row {
>>>>>   field1: Row {
>>>>>      renamed: string
>>>>>    }
>>>>> }
>>>>>
>>>>> However the Java object for the _nested_ row will return the old
>>>>> schema if getSchema() is called on it. This is because we only update the
>>>>> schema on the top-level row.
>>>>>
>>>>> I think this explains why your test works in the direct runner. If the
>>>>> row ever goes through an encode/decode path, it will come back correct. 
>>>>> The
>>>>> original incorrect Java objects are no longer around, and new (consistent)
>>>>> objects are constructed from the raw data and the PCollection schema.
>>>>> Dataflow tends to fuse ParDos together, so the following ParDo will see 
>>>>> the
>>>>> incorrect Row object. I bet the DirectRunner is encoding and decoding in
>>>>> between, which fixes the object.
>>>>>
>>>>> You can validate this theory by forcing a shuffle after RenameFields
>>>>> using Reshufflle. It should fix the issue If it does, let me know and I'll
>>>>> work on a fix to RenameFields.
>>>>>
>>>>> On Tue, Jun 1, 2021 at 7:39 PM Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>> Aha, yes this indeed another bug in the transform. The schema is set
>>>>>> on the top-level Row but not on any nested rows.
>>>>>>
>>>>>> On Tue, Jun 1, 2021 at 6:37 PM Matthew Ouyang <
>>>>>> matthew.ouy...@gmail.com> wrote:
>>>>>>
>>>>>>> Thank you everyone for your input.  I believe it will be easiest to
>>>>>>> respond to all feedback in a single message rather than messages per 
>>>>>>> person.
>>>>>>>
>>>>>>>    - NeedsRunner - The tests are run eventually, so obviously all
>>>>>>>    good on my end.  I was trying to run the smallest subset of test 
>>>>>>> cases
>>>>>>>    possible and didn't venture beyond `gradle test`.
>>>>>>>    - Stack Trace - There wasn't any unfortunately because no
>>>>>>>    exception thrown in the code.  The Beam Row was translated into a BQ
>>>>>>>    TableRow and an insertion was attempted.  The error "message" was 
>>>>>>> part of
>>>>>>>    the response JSON that came back as a result of a request against 
>>>>>>> the BQ
>>>>>>>    API.
>>>>>>>    - Desired Behaviour - (field0_1.field1_0, nestedStringField) ->
>>>>>>>    field0_1.nestedStringField is what I am looking for.
>>>>>>>    - Info Logging Findings (In Lieu of a Stack Trace)
>>>>>>>       - The Beam Schema was as expected with all renames applied.
>>>>>>>       - The example I provided was heavily stripped down in order
>>>>>>>       to isolate the problem.  My work example which a bit impractical 
>>>>>>> because
>>>>>>>       it's part of some generic tooling has 4 levels of nesting and 
>>>>>>> also produces
>>>>>>>       the correct output too.
>>>>>>>       - BigQueryUtils.toTableRow(Row) returns the expected TableRow
>>>>>>>       in DirectRunner.  In DataflowRunner however, only the top-level 
>>>>>>> renames
>>>>>>>       were reflected in the TableRow and all renames in the nested 
>>>>>>> fields weren't.
>>>>>>>       - BigQueryUtils.toTableRow(Row) recurses on the Row values
>>>>>>>       and uses the Row.schema to get the field names.  This makes sense 
>>>>>>> to me,
>>>>>>>       but if a value is actually a Row then its schema appears to be 
>>>>>>> inconsistent
>>>>>>>       with the top-level schema
>>>>>>>    - My Current Workaround - I forked RenameFields and replaced the
>>>>>>>    attachValues in expand method to be a "deep" rename.  This is 
>>>>>>> obviously
>>>>>>>    inefficient and I will not be submitting a PR for that.
>>>>>>>    - JIRA ticket - https://issues.apache.org/jira/browse/BEAM-12442
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 1, 2021 at 5:51 PM Reuven Lax <re...@google.com> wrote:
>>>>>>>
>>>>>>>> This transform is the same across all runners. A few comments on
>>>>>>>> the test:
>>>>>>>>
>>>>>>>>   - Using attachValues directly is error prone (per the comment on
>>>>>>>> the method). I recommend using the withFieldValue builders instead.
>>>>>>>>   - I recommend capturing the RenameFields PCollection into a local
>>>>>>>> variable of type PCollection<Row> and printing out the schema (which 
>>>>>>>> you
>>>>>>>> can get using the PCollection.getSchema method) to ensure that the 
>>>>>>>> output
>>>>>>>> schema looks like you expect.
>>>>>>>>    - RenameFields doesn't flatten. So renaming field0_1.field1_0 -
>>>>>>>> > nestedStringField results in field0_1.nestedStringField; if you 
>>>>>>>> > wanted to
>>>>>>>> flatten, then the better transform would be
>>>>>>>> Select.fieldNameAs("field0_1.field1_0", nestedStringField).
>>>>>>>>
>>>>>>>> This all being said, eyeballing the implementation of RenameFields
>>>>>>>> makes me think that it is buggy in the case where you specify a 
>>>>>>>> top-level
>>>>>>>> field multiple times like you do. I think it is simply adding the 
>>>>>>>> top-level
>>>>>>>> field into the output schema multiple times, and the second time is 
>>>>>>>> with
>>>>>>>> the field0_1 base name; I have no idea why your test doesn't catch 
>>>>>>>> this in
>>>>>>>> the DirectRunner, as it's equally broken there. Could you file a JIRA 
>>>>>>>> about
>>>>>>>> this issue and assign it to me?
>>>>>>>>
>>>>>>>> Reuven
>>>>>>>>
>>>>>>>> On Tue, Jun 1, 2021 at 12:47 PM Kenneth Knowles <k...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jun 1, 2021 at 12:42 PM Brian Hulette <bhule...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Matthew,
>>>>>>>>>>
>>>>>>>>>> > The unit tests also seem to be disabled for this as well and so
>>>>>>>>>> I don’t know if the PTransform behaves as expected.
>>>>>>>>>>
>>>>>>>>>> The exclusion for NeedsRunner tests is just a quirk in our
>>>>>>>>>> testing framework. NeedsRunner indicates that a test suite can't be
>>>>>>>>>> executed with the SDK alone, it needs a runner. So that exclusion 
>>>>>>>>>> just
>>>>>>>>>> makes sure we don't run the test when we're verifying the SDK by 
>>>>>>>>>> itself in
>>>>>>>>>> the :sdks:java:core:test task. The test is still run in other tasks 
>>>>>>>>>> where
>>>>>>>>>> we have a runner, most notably in the Java PreCommit [1], where we 
>>>>>>>>>> run it
>>>>>>>>>> as part of the :runners:direct-java:test task.
>>>>>>>>>>
>>>>>>>>>> That being said, we may only run these tests continuously with
>>>>>>>>>> the DirectRunner, I'm not sure if we test them on all the runners 
>>>>>>>>>> like we
>>>>>>>>>> do with ValidatesRunner tests.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That is correct. The tests are tests _of the transform_ so they
>>>>>>>>> run only on the DirectRunner. They are not tests of the runner, which 
>>>>>>>>> is
>>>>>>>>> only responsible for correctly implementing Beam's primitives. The
>>>>>>>>> transform should not behave differently on different runners, except 
>>>>>>>>> for
>>>>>>>>> fundamental differences in how they schedule work and checkpoint.
>>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> > The error message I’m receiving, : Error while reading data,
>>>>>>>>>> error message: JSON parsing error in row starting at position 0: No 
>>>>>>>>>> such
>>>>>>>>>> field: nestedField.field1_0, suggests the BigQuery is trying to
>>>>>>>>>> use the original name for the nested field and not the substitute 
>>>>>>>>>> name.
>>>>>>>>>>
>>>>>>>>>> Is there a stacktrace associated with this error? It would be
>>>>>>>>>> helpful to see where the error is coming from.
>>>>>>>>>>
>>>>>>>>>> Brian
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/4101/testReport/org.apache.beam.sdk.schemas.transforms/RenameFieldsTest/
>>>>>>>>>>
>>>>>>>>>> On Mon, May 31, 2021 at 5:02 PM Matthew Ouyang <
>>>>>>>>>> matthew.ouy...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I’m trying to use the RenameFields transform prior to inserting
>>>>>>>>>>> into BigQuery on nested fields.  Insertion into BigQuery is 
>>>>>>>>>>> successful with
>>>>>>>>>>> DirectRunner, but DataflowRunner has an issue with renamed nested 
>>>>>>>>>>> fields
>>>>>>>>>>>  The error message I’m receiving, : Error while reading data,
>>>>>>>>>>> error message: JSON parsing error in row starting at position 0: No 
>>>>>>>>>>> such
>>>>>>>>>>> field: nestedField.field1_0, suggests the BigQuery is trying to
>>>>>>>>>>> use the original name for the nested field and not the substitute 
>>>>>>>>>>> name.
>>>>>>>>>>>
>>>>>>>>>>> The code for RenameFields seems simple enough but does it behave
>>>>>>>>>>> differently in different runners?  Will a deep attachValues be 
>>>>>>>>>>> necessary in
>>>>>>>>>>> order get the nested renames to work across all runners? Is there 
>>>>>>>>>>> something
>>>>>>>>>>> wrong in my code?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/RenameFields.java#L186
>>>>>>>>>>>
>>>>>>>>>>> The unit tests also seem to be disabled for this as well and so
>>>>>>>>>>> I don’t know if the PTransform behaves as expected.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/build.gradle#L67
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/test/java/org/apache/beam/sdk/schemas/transforms/RenameFieldsTest.java
>>>>>>>>>>>
>>>>>>>>>>> package ca.loblaw.cerebro.PipelineControl;
>>>>>>>>>>>>
>>>>>>>>>>>> import com.google.api.services.bigquery.model.TableReference;
>>>>>>>>>>>> import
>>>>>>>>>>>> org.apache.beam.runners.dataflow.options.DataflowPipelineOptions
>>>>>>>>>>>> ;
>>>>>>>>>>>> import org.apache.beam.sdk.Pipeline;
>>>>>>>>>>>> import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
>>>>>>>>>>>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>>>>>>>>>>>> import org.apache.beam.sdk.schemas.Schema;
>>>>>>>>>>>> import org.apache.beam.sdk.schemas.transforms.RenameFields;
>>>>>>>>>>>> import org.apache.beam.sdk.transforms.Create;
>>>>>>>>>>>> import org.apache.beam.sdk.values.Row;
>>>>>>>>>>>>
>>>>>>>>>>>> import java.io.File;
>>>>>>>>>>>> import java.util.Arrays;
>>>>>>>>>>>> import java.util.HashSet;
>>>>>>>>>>>> import java.util.stream.Collectors;
>>>>>>>>>>>>
>>>>>>>>>>>> import static java.util.Arrays.*asList*;
>>>>>>>>>>>>
>>>>>>>>>>>> public class BQRenameFields {
>>>>>>>>>>>>     public static void main(String[] args) {
>>>>>>>>>>>>         PipelineOptionsFactory.*register*(
>>>>>>>>>>>> DataflowPipelineOptions.class);
>>>>>>>>>>>>         DataflowPipelineOptions options =
>>>>>>>>>>>> PipelineOptionsFactory.*fromArgs*(args).as(
>>>>>>>>>>>> DataflowPipelineOptions.class);
>>>>>>>>>>>>         options.setFilesToStage(
>>>>>>>>>>>>                 Arrays.*stream*(System.*getProperty*(
>>>>>>>>>>>> "java.class.path").
>>>>>>>>>>>>                         split(File.*pathSeparator*)).
>>>>>>>>>>>>                         map(entry -> (new
>>>>>>>>>>>> File(entry)).toString()).collect(Collectors.*toList*()));
>>>>>>>>>>>>
>>>>>>>>>>>>         Pipeline pipeline = Pipeline.*create*(options);
>>>>>>>>>>>>
>>>>>>>>>>>>         Schema nestedSchema = Schema.*builder*().addField(
>>>>>>>>>>>> Schema.Field.*nullable*("field1_0", Schema.FieldType.*STRING*
>>>>>>>>>>>> )).build();
>>>>>>>>>>>>         Schema.Field field = Schema.Field.*nullable*("field0_0"
>>>>>>>>>>>> , Schema.FieldType.*STRING*);
>>>>>>>>>>>>         Schema.Field nested = Schema.Field.*nullable*(
>>>>>>>>>>>> "field0_1", Schema.FieldType.*row*(nestedSchema));
>>>>>>>>>>>>         Schema.Field runner = Schema.Field.*nullable*(
>>>>>>>>>>>> "field0_2", Schema.FieldType.*STRING*);
>>>>>>>>>>>>         Schema rowSchema = Schema.*builder*()
>>>>>>>>>>>>                 .addFields(field, nested, runner)
>>>>>>>>>>>>                 .build();
>>>>>>>>>>>>         Row testRow = Row.*withSchema*(rowSchema).attachValues(
>>>>>>>>>>>> "value0_0", Row.*withSchema*(nestedSchema).attachValues(
>>>>>>>>>>>> "value1_0"), options.getRunner().toString());
>>>>>>>>>>>>         pipeline
>>>>>>>>>>>>                 .apply(Create.*of*(testRow).withRowSchema(
>>>>>>>>>>>> rowSchema))
>>>>>>>>>>>>                 .apply(RenameFields.<Row>*create*()
>>>>>>>>>>>>                         .rename("field0_0", "stringField")
>>>>>>>>>>>>                         .rename("field0_1", "nestedField")
>>>>>>>>>>>>                         .rename("field0_1.field1_0",
>>>>>>>>>>>> "nestedStringField")
>>>>>>>>>>>>                         .rename("field0_2", "runner"))
>>>>>>>>>>>>                 .apply(BigQueryIO.<Row>*write*()
>>>>>>>>>>>>                         .to(new TableReference().setProjectId(
>>>>>>>>>>>> "lt-dia-lake-exp-raw").setDatasetId("prototypes").setTableId(
>>>>>>>>>>>> "matto_renameFields"))
>>>>>>>>>>>>                         .withCreateDisposition(BigQueryIO.Write
>>>>>>>>>>>> .CreateDisposition.*CREATE_IF_NEEDED*)
>>>>>>>>>>>>                         .withWriteDisposition(BigQueryIO.Write.
>>>>>>>>>>>> WriteDisposition.*WRITE_APPEND*)
>>>>>>>>>>>>                         .withSchemaUpdateOptions(new HashSet<>(
>>>>>>>>>>>> *asList*(BigQueryIO.Write.SchemaUpdateOption.
>>>>>>>>>>>> *ALLOW_FIELD_ADDITION*)))
>>>>>>>>>>>>                         .useBeamSchema());
>>>>>>>>>>>>         pipeline.run();
>>>>>>>>>>>>     }
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>

Re: RenameFields behaves differently in DirectRunner

Reply via email to