Hi, I'm working alongside Ha on this.

You were right and wrong about the PigStorage format. It *is* a tab
delimited format, that was our mistake, but those tabs *can* contain tuples
and bags (using the parenthesis and bracket notation). Anyway, you're
comments helped us figure out the problem, so I thank you for the time you
took to offer the suggestion!

Now we've got the example data loading correctly and we can create a simple
example of the flatten, join, and re-group method you suggested. We added a
small improvement to not force us to re-group on many fields in the end. 

I do have one question further... when we GROUP everything back together in
the end I notice that the group field also gets included in the tuples.
Example:

A = (x, a1, b1)
B = (x, a2, b2)
X = GROUP A on x, B on x

We get: ( x, {(x,a1,b1), (x, a2, b2)} )

Which is essentially our desired result, but we don't need the duplicate x
in the inner tuples, is there an efficient way to just render this?

( x, {(a1,b1), (a2,b2)} )



-----Original Message-----
From: Pradeep Gollakota [mailto:[email protected]] 
Sent: Thursday, May 23, 2013 10:05 AM
To: [email protected]
Subject: Re: Complex joins

As far as I know, PigStorage cannot handle complex data types such as Bags
(It's just a delimiter seperated file). You might have to restructure your
data or use a different storage function or write a custom storage function.
Since your datamodel is modeled after OO, you might be able to leverage Avro
to maintain your datamodel.


On Wed, May 22, 2013 at 10:51 PM, Ho Duc Ha <[email protected]> wrote:

> We changed the load statement to:
>
> X = load 'data3' using PigStorage() as ( a:chararray, 
> b:bag{(c:chararray)} );
>
> But we get the same results with your statement:
>
> Y = FOREACH X GENERATE b;
> dump Y;
>
> output (of above command)
> -----------------------------------------
> ()
>
> What we really want to create is a set of the tuples in the bag b
> ('5'),('6')
>
> Another example which seems to fail to load properly is this (using 
> ints instead of strings):
>
> file: data4
> -------------
> ( 3, {(5),(6)} )
>
> X1 = load 'data4' using PigStorage() as ( a:int, b:bag{(c:int)} ); 
> dump X1;
>
> result:
> ---------
> (,)
>
> We also tried formatting the data like this, with the extra tuple 
> around it like I see in the output often, no luck:
> ((3, {(5),(6)} ))
>
>
>
>
> On Wed, May 22, 2013 at 11:32 PM, Sergey Goder <[email protected]
> >wrote:
>
> > Looks like you're probably not reading the data in correctly. 
> > Perhaps you need to specify the USING PigStorage() syntax and 
> > specify the correct delimiter as an argument.
> >
> > Also, if you want Y to just be the bag then you can just write it 
> > as;
> >
> > Y = FOREACH X GENERATE b;
> >
> >
> > On Wed, May 22, 2013 at 8:51 AM, Ho Duc Ha <[email protected]> wrote:
> >
> > > Actually I think you're right, the process in map/reduce isn't so 
> > > different.
> > >
> > > However, after trying to do this, we can't understand the output 
> > > we see below. We expected to see only '3' in alias Z, and '5' and 
> > > '6' in alias
> > Y,
> > > neither result was as expected.
> > >
> > > X = load 'data3' as ( a:chararray, b:bag{(c:chararray)} ); Y = 
> > > foreach X { W = foreach b generate *; generate W; }; Z = foreach X 
> > > generate a;
> > >
> > > data3
> > > ( '3', {( '5' ),('6')} )
> > >
> > > dump X
> > > (( '3', {( '5' ),('6')} ),)
> > >
> > > dump Y
> > > ({})
> > >
> > > dump Z
> > > (( '3', {( '5' ),('6')} ))
> > >
> > >
> > >
> > >
> > > On Wed, May 22, 2013 at 8:25 PM, Pradeep Gollakota <
> [email protected]
> > > >wrote:
> > >
> > > > Hi All,
> > > >
> > > > I'm a beginner pig user and this is my first post to the Pig 
> > > > mailing
> > > list.
> > > >
> > > > Anyway, to answer your question, the first thing that comes to 
> > > > my
> mind
> > is
> > > > that Pig may not be able to do a complex join like that.
> > > >
> > > > However, you can first flatten the bag in A, then do your join 
> > > > and
> then
> > > do
> > > > a group by do get the result in the format you are looking for. 
> > > > This
> > may
> > > > not be an idea solution, but it should work.
> > > >
> > > > Pradeep
> > > >
> > > >
> > > > On Wed, May 22, 2013 at 8:49 AM, Ho Duc Ha <[email protected]>
> wrote:
> > > >
> > > > > We've got a data type that is modeled after a typical
> object-oriented
> > > > > data-model format (simple fields, and collections of other
> objects).
> > > > We're
> > > > > trying to accomplish the following join:
> > > > >
> > > > > Here's out example input:
> > > > > -------------------------------------
> > > > > data1 = {  ( 'a1', { ('a2-thing1'), ('a2-thing2') } )  }
> > > > > data2 = {  ( 'a2-thing1', 'x-value1' ), ( 'a2-thing1', 
> > > > > 'x-value2' )
> >  }
> > > > >
> > > > > Here's what we want to get:
> > > > > --------------------------------------
> > > > > ( 'a1', { ('a2-thing1', {
> > > > > ('x-value1'), ('x-value2') }
> > > > > ) }
> > > > > )
> > > > >
> > > > > Notice that we are trying to join the collection of a2 fields 
> > > > > of
> the
> > > 1st
> > > > > data set, on the first field in the 2nd data set.
> > > > >
> > > > > We tried this:
> > > > > --------------------
> > > > > A = load 'data1' as ( a:tuple(a1:chararray,
> a2:bag{(a2t:chararray)})
> > );
> > > > > B = load 'data2' as ( a2t:chararray, x:chararray ); X = join A 
> > > > > by a2.a2t, B by a2t;
> > > > >
> > > > > We get this error:
> > > > > ---------------------------
> > > > > ERROR 1128: Cannot find field a2t in 
> > > > > a1:chararray,a2:bag{:tuple(a2t:chararray)}
> > > > >
> > > > > Try as we might, we cannot find the right way to do this 
> > > > > complex
> > join.
> > > > > Questions:
> > > > >   1) Should we be simplifying our data format into a more SQL
> > > table-like
> > > > > structure and doing more joins to reduce the complexity?
> > > > >   2) How can we accomplish joining data2's data into the data1
> > > "objects"?
> > > > >
> > > > > --
> > > > > Ho Duc Ha
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Ho Duc Ha
> > >
> >
>
>
>
> --
> Ho Duc Ha
>

Reply via email to