Re: Lucene Challenge - sum, count, avg, etc.

prasenjit mukherjee Fri, 02 Apr 2010 04:35:58 -0700

Pig generally takes csv-type flat files as input. And then you do
join/group-by/sum/count etc on the variables ( aka relations )


For Michael's example with following data:

*Affiliate / SaleDate / SaleAmount*
* mike / 2010-03-01 / 10.00
* john / 2010-03-01 / 10.00

One can write following pig-script :

r1 = load 'data.csv' as ( affiliate:string, saledate:string, amount:int )
r1 = filter r1 by saledate > my_udf_convert_in_secs(2010-03-01) AND
saledate < my_udf_convert_in_secs(2010-03-06)
r2 = group r1 by affiliate;
r3 = foreach r2 generate affiliate, SUM(amount) as totalrevenue;
r3 = order r3 by totalrevenue;
DUMP r3 into 'output.csv'

Remember that its only because of the additional sorting requirement
that we are forced to use pig, otherwise lucene can do the job (
except the sorting ) much faster.

-Prasen

On Thu, Apr 1, 2010 at 9:40 PM, Jason Eacott <ja...@hardlight.com.au> wrote:
> Thanks for the ref - didn't know about Pig before.
> the language and approach looks useful, so now I'm wondering if it
> couldn't be used
> across lucene over hadoop too. If data was indexed in lucene and Pig knew 
> that,
> then it could make for an interesting alternate lucene query language.
>
> could this work?
>
>
> prasenjit mukherjee wrote:
>> This looks like a use case more suited  for Pig ( over Hadoop ).
>>
>> It could be difficult for lucene to do sort and sum simultaneously as
>> sorting itself depends upon summed value.
>>
>> On Thu, Apr 1, 2010 at 11:47 PM, Michel Nadeau <aka...@gmail.com> wrote:
>>> Well that's my problem: we have a lot of records of all types (afiiliates,
>>> sales) so looping tons of records each time isn't possible.
>>>
>>> - Mike
>>> aka...@gmail.com
>>>
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene Challenge - sum, count, avg, etc.

Reply via email to