Re: Using PIG with complex SQL Statement

Vineet Mishra Wed, 05 Nov 2014 04:08:09 -0800

Hi Dan,

In your above mentioned snippet of script at last line


D = FOREACH D1 GENERATE FLATTEN(group) AS (category_id, category_name),
(int)COUNT(D0) AS cat_id_1_count:int;

I wanted to perform some group level operation, such as conditional based
additive operation for each grouped rows. How can I achieve the same
preserving in constant time.

Currently I have applied some filters inside FOREACH of group and
processing accordingly but that is making the job to be very tedious,
taking almost 25 times more than the expected run time. Looking for any
work around for it or any alternate approach on urgent basis. PIG version
used is 0.11.

Thanks!




On Thu, Oct 30, 2014 at 7:16 PM, Dan DeCapria, CivicScience <
[email protected]> wrote:

> Hi Vineet,
>
> Not entirely sure I'm understanding the problem correctly, but perhaps the
> error you are getting can be fixed by:
> sum_up_count = FILTER cat_ids BY (category_id == 1);
>
> I think that having a more clear description of your use case and input
> data sets along with your current pig script in development will help us in
> debugging this better.
> As such, this might not help, but here's a crack at reverse-engineering
> your problem space:
>
> A0 = LOAD '/input_01' USING PigStorage('\t','-noschema') AS (id:int,
> category_id:int, category_name: chararray, meta:chararray);
> A1 = FOREACH A0 GENERATE id, category_id, category_name;
> A = DISTINCT A1;
>
> B0 = LOAD '/input_02' USING PigStorage('\t','-noschema') AS (category_id:
> int, more_meta:chararray);
> B1 = FOREACH B0 GENERATE category_id;
> B = DISTINCT B1;
>
> C0 = JOIN A BY (category_id), B BY (category_id); -- 0,1,2, 3
> C1 = FOREACH C0 GENERATE $0 AS id, $1 AS category_id, $2 AS category_name;
> C = DISTINCT C1;
>
> D0 = FILTER C BY (category_id == 1);
> D1 = GROUP D0 BY (category_id, category_name);
> D = FOREACH D1 GENERATE FLATTEN(group) AS (category_id, category_name),
> (int)COUNT(D0) AS cat_id_1_count:int;
>
>
> Hope this helps,  -Dan
>
> On Thu, Oct 30, 2014 at 7:10 AM, Vineet Mishra <[email protected]>
> wrote:
>
> > Hi Dan,
> >
> > I am trying to put Filter inside a Foreach, the description of the
> group(on
> > which the FOREACH iteration is happening) is mentioned below. I am trying
> > to get counts of which all are passing the filter,
> >
> > Describe grp:
> > grp: {group: (a::category_id: int,a::category_name: chararray),joind:
> > {(a::id: int,a::category_id: int, b::category_id: int)}}
> >
> > Script:
> > purified = FOREACH grp {
> >  cat_ids_bag = FOREACH joind generate b::category_id;
> >
> > cat_ids = foreach cat_ids_bag generate flatten(category_id);
> >
> > sum_up_count = FILTER cat_ids BY category_id IN (1);
> > }
> >
> > Its throwing error,
> >   Syntax error, unexpected symbol at or near 'IN'
> >
> > Looking out for urgent response.
> > Thanks!
> >
> > On Tue, Oct 28, 2014 at 7:49 PM, Dan DeCapria, CivicScience <
> > [email protected]> wrote:
> >
> > > Hi Vineet,
> > >
> > > Expanding upon Lorand's resources, please note this all really depends
> on
> > > your actual use case.  When blocking out code to transform from SQL to
> > Pig
> > > latin, it's usually a good idea to just flow-chart plan the logical
> > process
> > > of what you want to do - just like you would for SQL queries.  Then
> it's
> > > just a matter of optimizing said queries - again, just like you would
> > with
> > > SQL queries on the DBA layer.  the 'under-the-hood' optimizations to MR
> > is
> > > done by Pig.
> > >
> > > Generically, this follows a simple paradigm, ie):
> > >
> > > --  optional runner: nohup pig -p REDUCERS=180 -f
> > /home/hadoop/my_file.pig
> > > 2>&1 > /tmp/my_file.out &
> > >
> > > --  some example configurations, ie) gzip compress the output
> > > SET output.compression.enabled true;
> > > SET output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
> > > --SET default_parallel $REDUCERS;
> > >
> > > A0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the typed
> > > schema); -- loader data source A
> > > A1 = FOREACH A0 GENERATE stuff; -- projection steps
> > > A = FILTER A1 BY (stuff); -- filter prior to JOIN
> > >
> > > B0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the typed
> > > schema); -- loader data source B
> > > B1 = FOREACH B0 GENERATE stuff; -- projection steps
> > > B = FILTER B1 BY (stuff); -- filter prior to JOIN
> > >
> > > C0 = JOIN A BY (pk), B BY (pk) PARALLEL $REDUCERS; -- where size(A) >
> > > size(B), PARALLEL to force use of all MR capacity
> > > C = FOREACH C0 GENERATE stuff; -- re-alias the JOIN step fields to what
> > you
> > > want, projection
> > >
> > > D0 = GROUP C BY (cks); -- perform your grouping operation
> > > D = FOREACH D0 GENERATE FLATTEN(group) AS (cks), (int)COUNT(C) AS
> > > example_count:int; -- whatever aggregation stats you wanted to perform
> > wrt
> > > the GROUP BY operation
> > >
> > > STORE D INTO '/path/to/hdfs/storage/file' USING PigStorage(); -- flat,
> > > tab-delimited file output of typed schema fields from [D]; here I used
> > > PigStorage() store.func
> > >
> > > Hope this helps,  -Dan
> > >
> > >
> > > On Tue, Oct 28, 2014 at 10:09 AM, Lorand Bendig <[email protected]>
> > wrote:
> > >
> > > > Hi Vineet,
> > > >
> > > > I'd recommend you have a look at these excellent resources:
> > > >
> > > > http://hortonworks.com/blog/pig-eye-for-the-sql-guy/
> > > > http://mortar-public-site-content.s3-website-us-east-1.
> > > > amazonaws.com/Mortar-Pig-Cheat-Sheet.pdf
> > > > http://www.slideshare.net/trihug/practical-pig/11
> > > >
> > > > --Lorand
> > > >
> > > >
> > > > On 28/10/14 14:34, Vineet Mishra wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> I was looking out to transform SQL statement which is consisting of
> > > >> multiple clause in the same query specifically, a JOIN followed by
> > some
> > > >> condition(WHERE) and finally grouping on some fields(GROUP BY).
> > > >> Can I have a link or some briefing which can guide me how can I
> > > implement
> > > >> this k/o of complex SQL statement in PIG.
> > > >>
> > > >> Thanks!
> > > >>
> > > >>
> > > >
> > >
> > >
> > > --
> > > Dan DeCapria
> > > CivicScience, Inc.
> > > Back-End Data IS/BI/DM/ML Specialist
> > >
> >
>
>
>
> --
> Dan DeCapria
> CivicScience, Inc.
> Back-End Data IS/BI/DM/ML Specialist
>

Re: Using PIG with complex SQL Statement

Reply via email to