Hi Cheolsoo/Pig User Group,
I am using the Pig 0.11 piggybank - AvroStorage. When merging multiple
schemas where default values have been specified in the avro schema; The
AvroStorage puts nulls in the merged data set.
Is this a known bug in the current implementation of the AvroStorage. Using an
example provided by one of my colleagues. The final dataset should contain
"NU", 0, "OU" for all values where the columns do not exist.
==> Employee3.avro <==
{
"type" : "record",
"name" : "employee",
"fields":[
{"name" : "name", "type" : "string", "default" : "NU"},
{"name" : "age", "type" : "int", "default" : 0 },
{"name" : "dept", "type": "string", "default" : "DU"}
]
}
==> Employee4.avro <==
{
"type" : "record",
"name" : "employee",
"fields":[
{"name" : "name", "type" : "string", "default" : "NU"},
{"name" : "age", "type" : "int", "default" : 0},
{"name" : "dept", "type": "string", "default" : "DU"},
{"name" : "office", "type": "string", "default" : "OU"}
]
}
==> Employee6.avro <==
{
"type" : "record",
"name" : "employee",
"fields":[
{"name" : "name", "type" : "string", "default" : "NU"},
{"name" : "lastname", "type": "string", "default" : "LNU"},
{"name" : "age", "type" : "int","default" : 0},
{"name" : "salary", "type": "int", "default" : 0},
{"name" : "dept", "type": "string","default" : "DU"},
{"name" : "office", "type": "string","default" : "OU"}
]
}
The pig script:
employee = load '$input' using
org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');
describe employee;
dump employee;
The call:
dump_employees.pig employee{3,4,6}.ser
The output:
employee: {name: chararray,age: int,dept: chararray,lastname: chararray,salary:
int,office: chararray}
(Milo,30,DH,,,)
(Asmya,34,PQ,,,)
(Baljit,23,RS,,,)
(Pune,60,Astrophysics,Warriors,5466,UTA)
(Rajsathan,20,Biochemistry,Royals,1378,Stanford)
(Chennai,50,Microbiology,Superkings,7338,Hopkins)
(Mumbai,20,Applied Math,Indians,4468,UAH)
(Praj,54,RMX,,,Champaign)
(Buba,767,HD,,,Sunnyvale)
(Manku,375,MS,,,New York)
Regards
Viraj
-----Original Message-----
From: Cheolsoo Park [mailto:[email protected]]
Sent: Tuesday, April 30, 2013 9:10 PM
To: [email protected]
Cc: Qi, Runping
Subject: Re: Override input schema in AvroStorage
Hi Steven,
The new AvroStorage will let you specify the input schema:
https://issues.apache.org/jira/browse/PIG-3015
In fact, somebody made the same request in a comment of the jira that I am
copying and pasting below:
Furthermore, we occasionally have issues with pig jobs picking the old
> schema when we have a schema update. Manually specifying the schema
> would fix this and give us more flexibility in defining the data we
> want pig to pull from a file.
This jira is work in progress, but hopefully it will be in next major released.
Thanks,
Cheolsoo
On Sat, Apr 27, 2013 at 3:24 PM, Enns, Steven <[email protected]> wrote:
> Resending now that I am subscribed :)
>
> On 4/25/13 4:01 PM, "Enns, Steven" <[email protected]> wrote:
>
> >Hi everyone,
> >
> >I would like to override the input schema in AvroStorage to make a
> >pig script robust to schema evolution. For example, suppose a new
> >field is added to an avro schema with a default value of null. If
> >the input to a pig script using this field includes both old and new
> >data, AvroStorage will merge the input schemas from the old and new
> >data. However, if the input includes only old data, the new schema
> >will not be available to AvroStorage and pig will fail to interpret
> >the script with an error such as "projected field [newField] does not
> >exist in schema". If AvroStorage accepted an input schema, the
> >script would be valid for both the new and old data. Is there any plan to
> >implement this?
> >
> >Thanks,
> >Steve
> >
>
>