[
https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anyi Li updated PIG-5115:
-------------------------
Description:
Pig ResourceSchema allows to use same field name but different type when they
are not in the same level. The pig schema like
{quote}
data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data:
chararray)}}
{quote}
Although _col2_ has been redefined, they are not appeared in the same level, it
is a totally valid pig schema.
However, once it is translated by AvroStorage, it will throw exception
{quote}
Can't redefine: col2
at
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
at
org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at
org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at
org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
at org.apache.pig.PigServer.execute(PigServer.java:1356)
at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
at
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:631)
at org.apache.pig.Main.main(Main.java:177)
Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
at org.apache.avro.Schema$Names.put(Schema.java:1042)
at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
at org.apache.avro.Schema.toString(Schema.java:297)
at org.apache.avro.Schema.toString(Schema.java:287)
at
org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
at
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
... 18 more
{quote}
It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses
tuple name as GenericRecord name as well as the fieldname that wraps the
record.
So it would like to produces the avro schema like the following
{quote}
{
"type": "record",
"name": "data",
"fields": [
{
"name": "col1",
"type": {
"type": "record",
"name": "col1_1",
"fields": [
{
"name": "col2",
"type": {
"type": "record",
"name": "col2",
"fields": [
{
"name": "col1_data",
"type": "string"
}
]
}
}
]
}
},
{
"name": "col2",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "col2",
"fields": [
{
"name": "col2_data",
"type": "string"
}
]
}
}
}
]
}
{quote}
But according to the avro 1.7.7 specs
([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined
as record and redefined as array later, it is an invalid schema, unless the
fullname (namespace + name) is unique.
Since AvroStorageSchemaConversionUtilities will generate avro record if the pig
schema is a tuple, we need a way to generate unique _recordName_.
{code:java}
public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
String recordName, final String recordNameSpace,
final Map<String, List<Schema>> definedRecordNames,
final Boolean doubleColonsToDoubleUnderscores) throws IOException {
if (rs == null) {
return null;
}
recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores);
List<Schema.Field> fields = new ArrayList<Schema.Field>();
Schema newSchema = Schema.createRecord(
recordName, null, recordNameSpace, false);
{code:java}
The AvroStorage from piggybank solved this problem by defining a static method
and generate unique recordname. We can implement the similar method for the
builtin AvroStorage
was:
Pig ResourceSchema allows to use same field name but different type when they
are not in the same level. The pig schema like
{quote}
data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data:
chararray)}}
{quote}
Although _col2_ has been redefined, they are not appeared in the same level, it
is a totally valid pig schema.
However, once it is translated by AvroStorage, it will throw exception
{quote}
Can't redefine: col2
at
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
at
org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at
org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at
org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at
org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
at org.apache.pig.PigServer.execute(PigServer.java:1356)
at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
at
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:631)
at org.apache.pig.Main.main(Main.java:177)
Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
at org.apache.avro.Schema$Names.put(Schema.java:1042)
at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
at org.apache.avro.Schema.toString(Schema.java:297)
at org.apache.avro.Schema.toString(Schema.java:287)
at
org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
at
org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
... 18 more
{quote}
It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses
tuple name as GenericRecord name as well as the fieldname that wraps the
record.
So it would like to produces the avro schema like the following
{code:json}
{
"type": "record",
"name": "data",
"fields": [
{
"name": "col1",
"type": {
"type": "record",
"name": "col1_1",
"fields": [
{
"name": "col2",
"type": {
"type": "record",
"name": "col2",
"fields": [
{
"name": "col1_data",
"type": "string"
}
]
}
}
]
}
},
{
"name": "col2",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "col2",
"fields": [
{
"name": "col2_data",
"type": "string"
}
]
}
}
}
]
}
{code:json}
But according to the avro 1.7.7 specs
([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined
as record and redefined as array later, it is an invalid schema, unless the
fullname (namespace + name) is unique.
Since AvroStorageSchemaConversionUtilities will generate avro record if the pig
schema is a tuple, we need a way to generate unique _recordName_.
{code: java}
public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
String recordName, final String recordNameSpace,
final Map<String, List<Schema>> definedRecordNames,
final Boolean doubleColonsToDoubleUnderscores) throws IOException {
if (rs == null) {
return null;
}
recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores);
List<Schema.Field> fields = new ArrayList<Schema.Field>();
Schema newSchema = Schema.createRecord(
recordName, null, recordNameSpace, false);
{code: java}
The AvroStorage from piggybank solved this problem by defining a static method
and generate unique recordname. We can implement the similar method for the
builtin AvroStorage
> Builtin AvroStorage generates the incorrect avro schema when same pig field
> names appears in the alias
> ------------------------------------------------------------------------------------------------------
>
> Key: PIG-5115
> URL: https://issues.apache.org/jira/browse/PIG-5115
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.17.0
> Reporter: Anyi Li
>
> Pig ResourceSchema allows to use same field name but different type when they
> are not in the same level. The pig schema like
> {quote}
> data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data:
> chararray)}}
> {quote}
> Although _col2_ has been redefined, they are not appeared in the same level,
> it is a totally valid pig schema.
> However, once it is translated by AvroStorage, it will throw exception
> {quote}
> Can't redefine: col2
> at
> org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
> at
> org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
> at
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
> at
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
> at
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
> at
> org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
> at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
> at
> org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
> at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
> at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
> at org.apache.pig.PigServer.execute(PigServer.java:1356)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
> at
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at org.apache.pig.Main.run(Main.java:631)
> at org.apache.pig.Main.main(Main.java:177)
> Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
> at org.apache.avro.Schema$Names.put(Schema.java:1042)
> at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
> at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
> at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
> at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
> at org.apache.avro.Schema.toString(Schema.java:297)
> at org.apache.avro.Schema.toString(Schema.java:287)
> at
> org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
> at
> org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
> at
> org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
> ... 18 more
> {quote}
> It is caused by a bug in AvroStorageSchemaConversionUtilities class which
> uses tuple name as GenericRecord name as well as the fieldname that wraps the
> record.
> So it would like to produces the avro schema like the following
> {quote}
> {
> "type": "record",
> "name": "data",
> "fields": [
> {
> "name": "col1",
> "type": {
> "type": "record",
> "name": "col1_1",
> "fields": [
> {
> "name": "col2",
> "type": {
> "type": "record",
> "name": "col2",
> "fields": [
> {
> "name": "col1_data",
> "type": "string"
> }
> ]
> }
> }
> ]
> }
> },
> {
> "name": "col2",
> "type": {
> "type": "array",
> "items": {
> "type": "record",
> "name": "col2",
> "fields": [
> {
> "name": "col2_data",
> "type": "string"
> }
> ]
> }
> }
> }
> ]
> }
> {quote}
> But according to the avro 1.7.7 specs
> ([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been
> defined as record and redefined as array later, it is an invalid schema,
> unless the fullname (namespace + name) is unique.
> Since AvroStorageSchemaConversionUtilities will generate avro record if the
> pig schema is a tuple, we need a way to generate unique _recordName_.
> {code:java}
> public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
> String recordName, final String recordNameSpace,
> final Map<String, List<Schema>> definedRecordNames,
> final Boolean doubleColonsToDoubleUnderscores) throws IOException {
> if (rs == null) {
> return null;
> }
> recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores);
> List<Schema.Field> fields = new ArrayList<Schema.Field>();
> Schema newSchema = Schema.createRecord(
> recordName, null, recordNameSpace, false);
> {code:java}
> The AvroStorage from piggybank solved this problem by defining a static
> method and generate unique recordname. We can implement the similar method
> for the builtin AvroStorage
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)