I think it's interesting to see what motivates different companies to choose Pig, what issues they have encountered and how they solved them, the general architecture, etc.
There are a few slide decks floating on the internet about how Pig is being used in production at Yahoo, Twitter, LinkedIn, Mendeley, Meebo, and a bunch of others, you can try looking at them for inspiration. Curious by what you mean when you say "serious data" :) D On Mon, Jan 10, 2011 at 5:41 PM, Jonathan Coveney <[email protected]>wrote: > If we ever do anything really worth writing about, maybe I'll ask the > higher > ups if we can do a case study... I'm not sure what sort of use information > would best benefit the Pig community, any thoughts? > > But I would love to give back, and show that Pig can handle some serious > data. > > 2011/1/10 Dmitriy Ryaboy <[email protected]> > > > Absolutely. > > Would love to hear what you are doing once it goes in production by the > > way. > > > > D > > > > On Mon, Jan 10, 2011 at 2:59 PM, Jonathan Coveney <[email protected] > > >wrote: > > > > > Thank you Julien. > > > > > > Once again I want to thank everyone for their help... I know that I use > > the > > > listserv a lot, but you guys have really helped me turn Pig into a > > powerful > > > tool in my workplace, and I know that Pig benefits from being used on > > large > > > production systems. > > > > > > Jon > > > > > > 2011/1/10 Julien Le Dem <[email protected]> > > > > > > > Hi Jonathan, > > > > It's input.getField(1).schema > > > > You can get the schema of your input by overriding Schema > > > > outputSchema(Schema) but it looks like you figured that out. > > > > outputSchema is called on the client side so if you want to make use > of > > > the > > > > input schema in exec(Tuple) you need to pass it in the UDF context: > > > > Properties properties = > > > > UDFContext.getUDFContext().getUDFProperties(this.getClass()); > > > > properties.put("inputSchema", inputSchema); > > > > Julien > > > > > > > > On 1/10/11 1:25 PM, "Jonathan Coveney" <[email protected]> wrote: > > > > > > > > I was able to get it work (I just didn't override the schema), but > I'd > > > > rather like it to have the schema so that describes and whatnot work. > > > > > > > > Is there no way, given a Schema with fields, to get the Schema of one > > of > > > > those fields? I can try to make a hack or something, but is there a > > > > limitation as to why you can't do Schema inner = input.getSchema(1) > > > > (instead > > > > of getField, which returns a Schema.FieldSchema, a getSchema function > > > which > > > > gave the actual schema of the given object?). > > > > > > > > As always, I appreciate the help. > > > > > > > > 2011/1/10 Jonathan Coveney <[email protected]> > > > > > > > > > I was under the impression that for Bag->Bag functions, providing > the > > > > > schema made things much faster? > > > > > > > > > > > > > > > 2011/1/10 Dmitriy Ryaboy <[email protected]> > > > > > > > > > >> Heck, if you know the schema at runtime, you could pass in a > string > > > > >> describing the schema as another argument. > > > > >> Or pass it in during initialization: > > > > >> > > > > >> define udfWithSchema myUdf('a:int, b:chararrahy') > > > > >> > > > > >> What do you need the schema for, exactly? > > > > >> > > > > >> D > > > > >> > > > > >> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney < > > > [email protected] > > > > >> >wrote: > > > > >> > > > > >> > I thought about that, but I do not know how long the tuple is. > > This > > > > >> isn't > > > > >> > an > > > > >> > issue from a calculation perspective, I suppose, as long as you > > make > > > > >> sure > > > > >> > that prop is the first thing in the bag. But from a > schema...hmm, > > I > > > > >> guess > > > > >> > you could just grab the schema of the other elements and build > it > > > > >> > accordingly? > > > > >> > > > > > >> > 2011/1/10 Dmitriy Ryaboy <[email protected]> > > > > >> > > > > > >> > > Jonathan, can't you just pass the bag A in? > > > > >> > > > > > > >> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney < > > > > [email protected] > > > > >> > > >wrote: > > > > >> > > > > > > >> > > > So I have a udf, let's call it myudf.bag2bag, which takes a > > bag > > > > >> which > > > > >> > > > contains "prop," and creates a new bag of tuples based on > > that. > > > > >> > > > > > > > >> > > > I have data in the form of > > > > >> > > > > > > > >> > > > id prop other1 other2 > > > > >> > > > > > > > >> > > > If all I care about is running the udf, obviously I can do > > > > >> > > > > > > > >> > > > A = LOAD 'file' AS (id, prop, other1, other2); > > > > >> > > > B = GROUP A BY id; > > > > >> > > > C = FOREACH B GENERATE group, > FLATTEN(myudf.bag2bag(A.prop)); > > > > >> > > > > > > > >> > > > And all is fine > > > > >> > > > > > > > >> > > > But what do I do if I want to hold on to the other data, > > > > especially > > > > >> if > > > > >> > > you > > > > >> > > > don't know how much there will be (from a bag2bag > perspective) > > > > >> > > > > > > > >> > > > My thought is that in bag2bag, you can pass in a touple of > > > > "extras," > > > > >> > > which > > > > >> > > > you then pass back, ie > > > > >> > > > > > > > >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop, > > > > >> (A,other1, > > > > >> > > > A.other2)))); > > > > >> > > > > > > > >> > > > I'm just not sure how I would specify the schema for this, > in > > > such > > > > a > > > > >> > way > > > > >> > > > that any number of entries could be in the tuple, and then > you > > > > could > > > > >> > just > > > > >> > > > sort of reference them later. > > > > >> > > > > > > > >> > > > Is this possible? > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > >
