Not sure if its the best way to do, but what you can do is run "describe
dirtydata" to see what is schema that pig defines for your avro data.
If you already have a avro schema stored somewhere in a .avsc file or you
can use avro command line tool to generate schema in a .avsc file first.
Once you have the schema, you can pass the schema file using AvroStorage()
-
dirtydata = LOAD '/data/0120422' USING
AvroStorage('no_schema_check','schema_file', 'hdfs path of your avsc avro
schema file');
describe dirtydata;
You should be able to see the schema/columns of your relation. Once you
have the schema for your pig relation, you can refer to the columns of the
relation used in join statement, by :: operator.
So lets say your dirtydata has 2 columns (name, salary), you can refer them
(after join) using dirtydata::name, dirtydata::salary
Its prefer to use describe statement on any relation, if you are confused
on how to refer or project from a given relation. Hope that helps.
Regards
Prav
On Tue, Oct 14, 2014 at 12:02 PM, Jakub Stransky <[email protected]>
wrote:
> Hello experienced users,
>
> I am a new to PIG and I have probably beginners question: Is is possible to
> get original fields after the join from the relation?
>
> Suppose I have a relation A which I want to filter by data from relation B.
> In order to find matching records I join the relations and then perform a
> filter. Than I would like to get just fields from relation A.
>
> Practical example:
> dirtydata = load '/data/0120422' using AvroStorage();
>
> sodtr = filter dirtydata by TransactionBlockNumber == 1;
> sto = foreach sodtr generate Dob.Value as Dob,StoreId,
> Created.UnixUtcTime;
> g = GROUP sto BY (Dob,StoreId);
> sodtime = FOREACH g GENERATE group.Dob AS Dob, group.StoreId as StoreId,
> MAX(sto.UnixUtcTime) AS latestStartOfDayTime;
>
> joined = join dirtydata by (Dob.Value, StoreId) LEFT OUTER, sodtime by
> (Dob, StoreId);
>
> cleandata = filter joined by dirtydata::Created.UnixUtcTime >=
> sodtime.latestStartOfDayTime;
> finaldata = FOREACH cleandata generate dirtydata:: ; -- <-- HERE I would
> like to get just colimns which belonged to original relation. Avro schema
> is rather complicated so it is not feasible to name are columns here.
>
> What is the best practice in that case? Is there any function? Or Is there
> a completely different approach to solve this kind of tasks?
>
> Thanks a lot for any help
> Jakub
>
>
>
> --
>