Thx. That works perfectly. However I am still left with deduplicate
operator :: from original relation e.g. dirtydata::Version. Is there a way
how to get rid of this? Because this causes me problems when trying to
store the data as a Avro data files using AvroStorage().



On 14 October 2014 15:27, praveenesh kumar <[email protected]> wrote:

> If you know the first and the last column, you can use the pig range
> operator, something like "foreach <relation> generate
> <first_col>..<last_col>;"
> Pig will take all the columns automatically that comes in between those
> columns
>
> On Tue, Oct 14, 2014 at 1:40 PM, Jakub Stransky <[email protected]>
> wrote:
>
> > Hi Prav,
> >
> > thanks for your answer. I have already had an avro schema for a input
> data
> > - dirtydata in the example. The sample works just fine. Essentially I am
> > loading data then I eliminate some rows given by condition and then I
> would
> > like to store them as clean data. The only issue here is that during the
> > elimination when performing join those join columns are added. So the
> task
> > I am facing is to remove those columns after row elimination. The schema
> is
> > rather complex so naming all columns is not an option. As I want to store
> > clean data with the same, original schema.
> >
> > Does anybody know if this is possible or simpler way of performing this
> > activity ?
> >
> > Many thanks
> > Jakub
> >
> > On 14 October 2014 13:46, praveenesh kumar <[email protected]> wrote:
> >
> > > Not sure if its the best way to do, but what you can do is run
> "describe
> > > dirtydata" to see what is schema that pig defines for your avro data.
> > > If you already have a avro schema stored somewhere in a .avsc file or
> you
> > > can use avro command line tool to generate schema in a .avsc file
> first.
> > >
> > > Once you have the schema, you can pass the schema file using
> > AvroStorage()
> > > -
> > >
> > > dirtydata = LOAD '/data/0120422' USING
> > > AvroStorage('no_schema_check','schema_file', 'hdfs path of your avsc
> avro
> > > schema file');
> > > describe dirtydata;
> > >
> > > You should be able to see the schema/columns of your relation. Once you
> > > have the schema for your pig relation, you can refer to the columns of
> > the
> > > relation used in join statement, by :: operator.
> > >
> > > So lets say your dirtydata has 2 columns (name, salary), you can refer
> > them
> > > (after join) using dirtydata::name, dirtydata::salary
> > >
> > > Its prefer to use describe statement on any relation, if you are
> confused
> > > on how to refer or project from a given relation. Hope that helps.
> > >
> > > Regards
> > > Prav
> > >
> > > On Tue, Oct 14, 2014 at 12:02 PM, Jakub Stransky <
> [email protected]>
> > > wrote:
> > >
> > > > Hello experienced users,
> > > >
> > > > I am a new to PIG and I have probably beginners question: Is is
> > possible
> > > to
> > > > get original fields after the join from the relation?
> > > >
> > > > Suppose I have a relation A which I want to filter by data from
> > relation
> > > B.
> > > > In order to find matching records I join the relations and then
> > perform a
> > > > filter. Than I would like to get just fields from relation A.
> > > >
> > > > Practical example:
> > > > dirtydata = load '/data/0120422' using AvroStorage();
> > > >
> > > > sodtr = filter dirtydata by TransactionBlockNumber == 1;
> > > > sto   = foreach sodtr generate Dob.Value as Dob,StoreId,
> > > > Created.UnixUtcTime;
> > > > g     = GROUP sto BY  (Dob,StoreId);
> > > > sodtime = FOREACH g GENERATE group.Dob AS Dob, group.StoreId as
> > StoreId,
> > > > MAX(sto.UnixUtcTime) AS latestStartOfDayTime;
> > > >
> > > > joined = join dirtydata by (Dob.Value, StoreId) LEFT OUTER, sodtime
> by
> > > > (Dob, StoreId);
> > > >
> > > > cleandata = filter joined by dirtydata::Created.UnixUtcTime >=
> > > > sodtime.latestStartOfDayTime;
> > > > finaldata = FOREACH cleandata generate dirtydata:: ;  -- <-- HERE I
> > would
> > > > like to get just colimns which belonged to original relation. Avro
> > schema
> > > > is rather complicated so it is not feasible to name are columns here.
> > > >
> > > > What is the best practice in that case? Is there any function? Or Is
> > > there
> > > > a completely different approach to solve this kind of tasks?
> > > >
> > > > Thanks a lot for any help
> > > > Jakub
> > > >
> > > >
> > > >
> > > > --
> > > >
> > >
> >
> >
> >
> > --
> > Jakub Stransky
> > cz.linkedin.com/in/jakubstransky
> >
>



-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

Reply via email to