[jira] [Updated] (PIG-5115) Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias

Anyi Li (JIRA) Wed, 25 Jan 2017 14:05:51 -0800

     [ 
https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anyi Li updated PIG-5115:
-------------------------
    Description: 
Pig ResourceSchema allows to use same field names but different types when they 
are not in the same level. The pig schema like
{quote}
data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
chararray)}}
{quote}

Although _col2_ has been redefined, they are not appeared in the same level, it 
is a totally valid pig schema. 

However, once it is translated by AvroStorage, it will throw exception 
{noformat}
Can't redefine: col2
        at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
        at 
org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
        at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
        at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
        at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
        at 
org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
        at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
        at 
org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
        at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
        at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
        at org.apache.pig.PigServer.execute(PigServer.java:1356)
        at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
        at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
        at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
        at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
        at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
        at org.apache.pig.Main.run(Main.java:631)
        at org.apache.pig.Main.main(Main.java:177)
Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
        at org.apache.avro.Schema$Names.put(Schema.java:1042)
        at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
        at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
        at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
        at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
        at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
        at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
        at org.apache.avro.Schema.toString(Schema.java:297)
        at org.apache.avro.Schema.toString(Schema.java:287)
        at 
org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
        at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
        at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
        ... 18 more
{noformat}

It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses 
tuple name as GenericRecord name as well as the fieldname that wraps the 
record. 

So it would like to  produces the avro schema like the following 
{noformat}
{
  "type": "record",
  "name": "data",
  "fields": [
    {
      "name": "col1",
      "type": {
        "type": "record",
        "name": "col1_1",
        "fields": [
          {
            "name": "col2",
            "type": {
              "type": "record",
              "name": "col2",
              "fields": [
                {
                  "name": "col1_data",
                  "type": "string"
                }
              ]
            }
          }
        ]
      }
    },
    {
      "name": "col2",
      "type": {
        "type": "array",
        "items": {
          "type": "record",
          "name": "col2",
          "fields": [
            {
              "name": "col2_data",
              "type": "string"
            }
          ]
        }
      }
    }
  ]
}

{noformat}
But according to the avro 1.7.7  specs 
([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined 
as record and redefined as array later, it is an invalid schema, unless the 
fullname (namespace + name) is unique. 

Since AvroStorageSchemaConversionUtilities will generate avro record if the pig 
schema is a tuple, we need a way to generate unique _recordName_. 

{code}
public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
      String recordName, final String recordNameSpace,
      final Map<String, List<Schema>> definedRecordNames,
      final Boolean doubleColonsToDoubleUnderscores) throws IOException {

    if (rs == null) {
      return null;
    }

    recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores);

    List<Schema.Field> fields = new ArrayList<Schema.Field>();
    Schema newSchema = Schema.createRecord(
            recordName, null, recordNameSpace, false);

{code}

The AvroStorage from piggybank solved this problem by defining a static method 
and generate unique recordname. We can implement the similar method for the 
builtin AvroStorage 

 

  was:
Pig ResourceSchema allows to use same field name but different type when they 
are not in the same level. The pig schema like
{quote}
data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
chararray)}}
{quote}

Although _col2_ has been redefined, they are not appeared in the same level, it 
is a totally valid pig schema. 

However, once it is translated by AvroStorage, it will throw exception 
{noformat}
Can't redefine: col2
        at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
        at 
org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
        at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
        at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
        at 
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
        at 
org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
        at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
        at 
org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
        at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
        at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
        at org.apache.pig.PigServer.execute(PigServer.java:1356)
        at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
        at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
        at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
        at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
        at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
        at org.apache.pig.Main.run(Main.java:631)
        at org.apache.pig.Main.main(Main.java:177)
Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
        at org.apache.avro.Schema$Names.put(Schema.java:1042)
        at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
        at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
        at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
        at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
        at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
        at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
        at org.apache.avro.Schema.toString(Schema.java:297)
        at org.apache.avro.Schema.toString(Schema.java:287)
        at 
org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
        at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
        at 
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
        ... 18 more
{noformat}

It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses 
tuple name as GenericRecord name as well as the fieldname that wraps the 
record. 

So it would like to  produces the avro schema like the following 
{noformat}
{
  "type": "record",
  "name": "data",
  "fields": [
    {
      "name": "col1",
      "type": {
        "type": "record",
        "name": "col1_1",
        "fields": [
          {
            "name": "col2",
            "type": {
              "type": "record",
              "name": "col2",
              "fields": [
                {
                  "name": "col1_data",
                  "type": "string"
                }
              ]
            }
          }
        ]
      }
    },
    {
      "name": "col2",
      "type": {
        "type": "array",
        "items": {
          "type": "record",
          "name": "col2",
          "fields": [
            {
              "name": "col2_data",
              "type": "string"
            }
          ]
        }
      }
    }
  ]
}

{noformat}
But according to the avro 1.7.7  specs 
([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined 
as record and redefined as array later, it is an invalid schema, unless the 
fullname (namespace + name) is unique. 

Since AvroStorageSchemaConversionUtilities will generate avro record if the pig 
schema is a tuple, we need a way to generate unique _recordName_. 

{code}
public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
      String recordName, final String recordNameSpace,
      final Map<String, List<Schema>> definedRecordNames,
      final Boolean doubleColonsToDoubleUnderscores) throws IOException {

    if (rs == null) {
      return null;
    }

    recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores);

    List<Schema.Field> fields = new ArrayList<Schema.Field>();
    Schema newSchema = Schema.createRecord(
            recordName, null, recordNameSpace, false);

{code}

The AvroStorage from piggybank solved this problem by defining a static method 
and generate unique recordname. We can implement the similar method for the 
builtin AvroStorage 

 


> Builtin AvroStorage generates incorrect avro schema when the same pig field 
> name appears in the alias
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PIG-5115
>                 URL: https://issues.apache.org/jira/browse/PIG-5115
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.17.0
>            Reporter: Anyi Li
>
> Pig ResourceSchema allows to use same field names but different types when 
> they are not in the same level. The pig schema like
> {quote}
> data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
> chararray)}}
> {quote}
> Although _col2_ has been redefined, they are not appeared in the same level, 
> it is a totally valid pig schema. 
> However, once it is translated by AvroStorage, it will throw exception 
> {noformat}
> Can't redefine: col2
>         at 
> org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
>         at 
> org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
>         at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
>         at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
>         at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
>         at 
> org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
>         at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>         at 
> org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
>         at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
>         at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
>         at org.apache.pig.PigServer.execute(PigServer.java:1356)
>         at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
>         at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
>         at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
>         at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
>         at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
>         at org.apache.pig.Main.run(Main.java:631)
>         at org.apache.pig.Main.main(Main.java:177)
> Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
>         at org.apache.avro.Schema$Names.put(Schema.java:1042)
>         at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
>         at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
>         at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
>         at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
>         at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
>         at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
>         at org.apache.avro.Schema.toString(Schema.java:297)
>         at org.apache.avro.Schema.toString(Schema.java:287)
>         at 
> org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
>         at 
> org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
>         at 
> org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
>         ... 18 more
> {noformat}
> It is caused by a bug in AvroStorageSchemaConversionUtilities class which 
> uses tuple name as GenericRecord name as well as the fieldname that wraps the 
> record. 
> So it would like to  produces the avro schema like the following 
> {noformat}
> {
>   "type": "record",
>   "name": "data",
>   "fields": [
>     {
>       "name": "col1",
>       "type": {
>         "type": "record",
>         "name": "col1_1",
>         "fields": [
>           {
>             "name": "col2",
>             "type": {
>               "type": "record",
>               "name": "col2",
>               "fields": [
>                 {
>                   "name": "col1_data",
>                   "type": "string"
>                 }
>               ]
>             }
>           }
>         ]
>       }
>     },
>     {
>       "name": "col2",
>       "type": {
>         "type": "array",
>         "items": {
>           "type": "record",
>           "name": "col2",
>           "fields": [
>             {
>               "name": "col2_data",
>               "type": "string"
>             }
>           ]
>         }
>       }
>     }
>   ]
> }
> {noformat}
> But according to the avro 1.7.7  specs 
> ([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been 
> defined as record and redefined as array later, it is an invalid schema, 
> unless the fullname (namespace + name) is unique. 
> Since AvroStorageSchemaConversionUtilities will generate avro record if the 
> pig schema is a tuple, we need a way to generate unique _recordName_. 
> {code}
> public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
>       String recordName, final String recordNameSpace,
>       final Map<String, List<Schema>> definedRecordNames,
>       final Boolean doubleColonsToDoubleUnderscores) throws IOException {
>     if (rs == null) {
>       return null;
>     }
>     recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores);
>     List<Schema.Field> fields = new ArrayList<Schema.Field>();
>     Schema newSchema = Schema.createRecord(
>             recordName, null, recordNameSpace, false);
> {code}
> The AvroStorage from piggybank solved this problem by defining a static 
> method and generate unique recordname. We can implement the similar method 
> for the builtin AvroStorage 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-5115) Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias

Reply via email to