Re: Holding onto info when doing a udf on a bag

Dmitriy Ryaboy Mon, 10 Jan 2011 18:03:38 -0800

I think it's interesting to see what motivates different companies to choose
Pig, what issues they have encountered and how they solved them, the general
architecture, etc.


There are a few slide decks floating on the internet about how Pig is being
used in production at Yahoo, Twitter, LinkedIn, Mendeley, Meebo, and a bunch
of others, you can try looking at them for inspiration.

Curious by what you mean when you say "serious data" :)

D

On Mon, Jan 10, 2011 at 5:41 PM, Jonathan Coveney <[email protected]>wrote:

> If we ever do anything really worth writing about, maybe I'll ask the
> higher
> ups if we can do a case study... I'm not sure what sort of use information
> would best benefit the Pig community, any thoughts?
>
> But I would love to give back, and show that Pig can handle some serious
> data.
>
> 2011/1/10 Dmitriy Ryaboy <[email protected]>
>
> > Absolutely.
> > Would love to hear what you are doing once it goes in production by the
> > way.
> >
> > D
> >
> > On Mon, Jan 10, 2011 at 2:59 PM, Jonathan Coveney <[email protected]
> > >wrote:
> >
> > > Thank you Julien.
> > >
> > > Once again I want to thank everyone for their help... I know that I use
> > the
> > > listserv a lot, but you guys have really helped me turn Pig into a
> > powerful
> > > tool in my workplace, and I know that Pig benefits from being used on
> > large
> > > production systems.
> > >
> > > Jon
> > >
> > > 2011/1/10 Julien Le Dem <[email protected]>
> > >
> > > > Hi Jonathan,
> > > > It's input.getField(1).schema
> > > > You can get the schema of your input by overriding Schema
> > > > outputSchema(Schema) but it looks like you figured that out.
> > > > outputSchema is called on the client side so if you want to make use
> of
> > > the
> > > > input schema in exec(Tuple) you need to pass it in the UDF context:
> > > > Properties properties =
> > > > UDFContext.getUDFContext().getUDFProperties(this.getClass());
> > > > properties.put("inputSchema", inputSchema);
> > > > Julien
> > > >
> > > > On 1/10/11 1:25 PM, "Jonathan Coveney" <[email protected]> wrote:
> > > >
> > > > I was able to get it work (I just didn't override the schema), but
> I'd
> > > > rather like it to have the schema so that describes and whatnot work.
> > > >
> > > > Is there no way, given a Schema with fields, to get the Schema of one
> > of
> > > > those fields? I can try to make a hack or something, but is there a
> > > > limitation as to why you can't do Schema inner = input.getSchema(1)
> > > > (instead
> > > > of getField, which returns a Schema.FieldSchema, a getSchema function
> > > which
> > > > gave the actual schema of the given object?).
> > > >
> > > > As always, I appreciate the help.
> > > >
> > > > 2011/1/10 Jonathan Coveney <[email protected]>
> > > >
> > > > > I was under the impression that for Bag->Bag functions, providing
> the
> > > > > schema made things much faster?
> > > > >
> > > > >
> > > > > 2011/1/10 Dmitriy Ryaboy <[email protected]>
> > > > >
> > > > >> Heck, if you know the schema at runtime, you could pass in a
> string
> > > > >> describing the schema as another argument.
> > > > >> Or pass it in during initialization:
> > > > >>
> > > > >> define udfWithSchema myUdf('a:int, b:chararrahy')
> > > > >>
> > > > >> What do you need the schema for, exactly?
> > > > >>
> > > > >> D
> > > > >>
> > > > >> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <
> > > [email protected]
> > > > >> >wrote:
> > > > >>
> > > > >> > I thought about that, but I do not know how long the tuple is.
> > This
> > > > >> isn't
> > > > >> > an
> > > > >> > issue from a calculation perspective, I suppose, as long as you
> > make
> > > > >> sure
> > > > >> > that prop is the first thing in the bag. But from a
> schema...hmm,
> > I
> > > > >> guess
> > > > >> > you could just grab the schema of the other elements and build
> it
> > > > >> > accordingly?
> > > > >> >
> > > > >> > 2011/1/10 Dmitriy Ryaboy <[email protected]>
> > > > >> >
> > > > >> > > Jonathan, can't you just pass the bag A in?
> > > > >> > >
> > > > >> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <
> > > > [email protected]
> > > > >> > > >wrote:
> > > > >> > >
> > > > >> > > > So I have a udf, let's call it myudf.bag2bag, which takes a
> > bag
> > > > >> which
> > > > >> > > > contains "prop," and creates a new bag of tuples based on
> > that.
> > > > >> > > >
> > > > >> > > > I have data in the form of
> > > > >> > > >
> > > > >> > > > id    prop    other1    other2
> > > > >> > > >
> > > > >> > > > If all I care about is running the udf, obviously I can do
> > > > >> > > >
> > > > >> > > > A = LOAD 'file' AS (id, prop, other1, other2);
> > > > >> > > > B = GROUP A BY id;
> > > > >> > > > C = FOREACH B GENERATE group,
> FLATTEN(myudf.bag2bag(A.prop));
> > > > >> > > >
> > > > >> > > > And all is fine
> > > > >> > > >
> > > > >> > > > But what do I do if I want to hold on to the other data,
> > > > especially
> > > > >> if
> > > > >> > > you
> > > > >> > > > don't know how much there will be (from a bag2bag
> perspective)
> > > > >> > > >
> > > > >> > > > My thought is that in bag2bag, you can pass in a touple of
> > > > "extras,"
> > > > >> > > which
> > > > >> > > > you then pass back, ie
> > > > >> > > >
> > > > >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop,
> > > > >> (A,other1,
> > > > >> > > > A.other2))));
> > > > >> > > >
> > > > >> > > > I'm just not sure how I would specify the schema for this,
> in
> > > such
> > > > a
> > > > >> > way
> > > > >> > > > that any number of entries could be in the tuple, and then
> you
> > > > could
> > > > >> > just
> > > > >> > > > sort of reference them later.
> > > > >> > > >
> > > > >> > > > Is this possible?
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > > >
> > >
> >
>

Re: Holding onto info when doing a udf on a bag

Reply via email to