Pig generally takes csv-type flat files as input. And then you do join/group-by/sum/count etc on the variables ( aka relations )
For Michael's example with following data: *Affiliate / SaleDate / SaleAmount* * mike / 2010-03-01 / 10.00 * john / 2010-03-01 / 10.00 One can write following pig-script : r1 = load 'data.csv' as ( affiliate:string, saledate:string, amount:int ) r1 = filter r1 by saledate > my_udf_convert_in_secs(2010-03-01) AND saledate < my_udf_convert_in_secs(2010-03-06) r2 = group r1 by affiliate; r3 = foreach r2 generate affiliate, SUM(amount) as totalrevenue; r3 = order r3 by totalrevenue; DUMP r3 into 'output.csv' Remember that its only because of the additional sorting requirement that we are forced to use pig, otherwise lucene can do the job ( except the sorting ) much faster. -Prasen On Thu, Apr 1, 2010 at 9:40 PM, Jason Eacott <ja...@hardlight.com.au> wrote: > Thanks for the ref - didn't know about Pig before. > the language and approach looks useful, so now I'm wondering if it > couldn't be used > across lucene over hadoop too. If data was indexed in lucene and Pig knew > that, > then it could make for an interesting alternate lucene query language. > > could this work? > > > prasenjit mukherjee wrote: >> This looks like a use case more suited for Pig ( over Hadoop ). >> >> It could be difficult for lucene to do sort and sum simultaneously as >> sorting itself depends upon summed value. >> >> On Thu, Apr 1, 2010 at 11:47 PM, Michel Nadeau <aka...@gmail.com> wrote: >>> Well that's my problem: we have a lot of records of all types (afiiliates, >>> sales) so looping tons of records each time isn't possible. >>> >>> - Mike >>> aka...@gmail.com >>> >>> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org