Re: [Question] Apache Beam v2.30 breaking change to BigQuery nested arrays of Maps

Jimmy Headdon Sun, 17 Jul 2022 04:00:54 -0700

Hello - I've had no joy investigating this further, nor finding the
specific change in v2.30 that caused this break in behaviour.  I have
tested with v2.40, to no avail.


Would it be useful if I put together an example project and submitted it to
this mailing list?  It would need a BigQuery database (or a mock, perhaps)
to see this in action - any recommendations on how I could achieve this?


Thanks again

On Thu, 23 Jun 2022 at 13:30, Jimmy Headdon <jimmy.head...@mojiworks.com>
wrote:

> Hello everyone
>
> Thanks for your continued efforts on the project and for help with my
> previous request!  I've run into an issue where upgrading to Beam v2.30 (or
> any version up to and including v2.39) causes a breaking error when I
> attempt to write an array of Map objects to Google's BigQuery:
>
> "Error while reading data, error message: JSON parsing error in row
> starting at position 0: Nested arrays not allowed."
>
>
> I'll explain my setup below, but for reference this behaviour worked
> against Beam's Java library versions 2.19 through 2.29 (I checked them all,
> and it definitely breaks from v2.30 onwards).
>
> *Pipeline Setup*
> I have a Java pipeline that reads TableRow objects from BigQuery, performs
> some fairly basic aggregation through Beam SQL and some in-memory
> transformations, and writes the results back to a different table in
> BigQuery.
>
> The initial BQ TableRow schema is a handful of STRING, INTEGER and BOOLEAN
> objects, plus a single (nullable) RECORD field.  This RECORD field is the
> one that causes problems, it is a REPEATED record with simple STRING fields
> for each instance.
>
> When converting the BQ TableRow to Beam Row objects, I'm setting a row
> schema for the above BQ RECORD as follows (I've also tried a Row field,
> at the very bottom of this email, for reference):
>
> .addNullableField("experiments",
> FieldType.array(FieldType.map(FieldType.STRING, FieldType.STRING)))
>
>
> I have a custom ParDo function to convert the TableRow objects to Row (at
> the very bottom of this email, for reference).  It creates a Map instance
> for every repeated record from BQ, and adds it to an ArrayList which is
> then added to the new Row through Row.withSchema(x).addValue (I've also
> tried .addArray).
>
> When writing back to BigQuery (I use
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryUtils.toTableRow()) the
> TableRow is created but then throws the aforementioned "Nested arrays not
> allowed" error.
>
> I'm fairly confident the issue occurs in the initial conversion from
> TableRow to Row, as I can print row.toString() and the outputs differ.
> It's worth noting that printing the TableRow before conversion shows no
> difference between versions.
>
> System.out.println(row.toString());
>
>
> Here is the output for this RECORD field with Beam v2.29:
>
> [{"name": “forename”, "type": “person”}]
>
>
> And the same input data printed with Beam v2.30 (or higher):
>
> [[{"key":"name","value”:”forename”},{“key":"type","value”:”person”}]]
>
>
> It has become a nested structure with Key and Value fields.
>
>
> Nothing obvious stands out in the v2.30 release notes, I'm thinking it
> could be an upstream library change or possibly an indirect Coder change?
> Any ideas?
>
> Thanks again, let me know if I can provide any other snippets, input data,
> etc.
>
>
> *TableRow to Row Conversion*
>
>         public ArrayList<Map<String, String>>
> getRecords(ArrayList<Map<String, Object>> records) {
>
>             if (records == null) {
>
>                 return null;
>
>             }
>
>
>             ArrayList<Map<String, String>> recordRows = new
> ArrayList<Map<String, String>>();
>
>
>             for (Map<String, Object> record: records) {
>
>                 Map<String, String> e = new HashMap<String, String>();
>
>                 e.put("name", (String)experiment.getOrDefault("name",
> ""));
>
>                 e.put("type", (String)experiment.getOrDefault("type",
> ""));
>
>
>                 recordRows.add(e);
>
>             }
>
>
>             return recordRows;
>
>         }
>
>
> *Alternate Row Schema*
>
>     public static Schema recordsRowSchema =
>         Schema.builder()
>               .addNullableField("name", FieldType.STRING)
>               .addNullableField("type", FieldType.STRING)
>               .build();
>
>     .addNullableField("records",
> FieldType.array(FieldType.row(recordsRowSchema)))
>

Re: [Question] Apache Beam v2.30 breaking change to BigQuery nested arrays of Maps

Reply via email to