Re: Holding onto info when doing a udf on a bag

Jonathan Coveney Mon, 10 Jan 2011 15:00:02 -0800

Thank you Julien.

Once again I want to thank everyone for their help... I know that I use the
listserv a lot, but you guys have really helped me turn Pig into a powerful
tool in my workplace, and I know that Pig benefits from being used on large
production systems.


Jon

2011/1/10 Julien Le Dem <[email protected]>

> Hi Jonathan,
> It's input.getField(1).schema
> You can get the schema of your input by overriding Schema
> outputSchema(Schema) but it looks like you figured that out.
> outputSchema is called on the client side so if you want to make use of the
> input schema in exec(Tuple) you need to pass it in the UDF context:
> Properties properties =
> UDFContext.getUDFContext().getUDFProperties(this.getClass());
> properties.put("inputSchema", inputSchema);
> Julien
>
> On 1/10/11 1:25 PM, "Jonathan Coveney" <[email protected]> wrote:
>
> I was able to get it work (I just didn't override the schema), but I'd
> rather like it to have the schema so that describes and whatnot work.
>
> Is there no way, given a Schema with fields, to get the Schema of one of
> those fields? I can try to make a hack or something, but is there a
> limitation as to why you can't do Schema inner = input.getSchema(1)
> (instead
> of getField, which returns a Schema.FieldSchema, a getSchema function which
> gave the actual schema of the given object?).
>
> As always, I appreciate the help.
>
> 2011/1/10 Jonathan Coveney <[email protected]>
>
> > I was under the impression that for Bag->Bag functions, providing the
> > schema made things much faster?
> >
> >
> > 2011/1/10 Dmitriy Ryaboy <[email protected]>
> >
> >> Heck, if you know the schema at runtime, you could pass in a string
> >> describing the schema as another argument.
> >> Or pass it in during initialization:
> >>
> >> define udfWithSchema myUdf('a:int, b:chararrahy')
> >>
> >> What do you need the schema for, exactly?
> >>
> >> D
> >>
> >> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <[email protected]
> >> >wrote:
> >>
> >> > I thought about that, but I do not know how long the tuple is. This
> >> isn't
> >> > an
> >> > issue from a calculation perspective, I suppose, as long as you make
> >> sure
> >> > that prop is the first thing in the bag. But from a schema...hmm, I
> >> guess
> >> > you could just grab the schema of the other elements and build it
> >> > accordingly?
> >> >
> >> > 2011/1/10 Dmitriy Ryaboy <[email protected]>
> >> >
> >> > > Jonathan, can't you just pass the bag A in?
> >> > >
> >> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <
> [email protected]
> >> > > >wrote:
> >> > >
> >> > > > So I have a udf, let's call it myudf.bag2bag, which takes a bag
> >> which
> >> > > > contains "prop," and creates a new bag of tuples based on that.
> >> > > >
> >> > > > I have data in the form of
> >> > > >
> >> > > > id    prop    other1    other2
> >> > > >
> >> > > > If all I care about is running the udf, obviously I can do
> >> > > >
> >> > > > A = LOAD 'file' AS (id, prop, other1, other2);
> >> > > > B = GROUP A BY id;
> >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop));
> >> > > >
> >> > > > And all is fine
> >> > > >
> >> > > > But what do I do if I want to hold on to the other data,
> especially
> >> if
> >> > > you
> >> > > > don't know how much there will be (from a bag2bag perspective)
> >> > > >
> >> > > > My thought is that in bag2bag, you can pass in a touple of
> "extras,"
> >> > > which
> >> > > > you then pass back, ie
> >> > > >
> >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop,
> >> (A,other1,
> >> > > > A.other2))));
> >> > > >
> >> > > > I'm just not sure how I would specify the schema for this, in such
> a
> >> > way
> >> > > > that any number of entries could be in the tuple, and then you
> could
> >> > just
> >> > > > sort of reference them later.
> >> > > >
> >> > > > Is this possible?
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>
>

Re: Holding onto info when doing a udf on a bag

Reply via email to