Is your comparison function equals or is there some transformation that could be applied to hdata and skey so it could be equals? If so you could use semi join instead, which should be much more efficient.

Alan.

On Apr 14, 2011, at 8:21 PM, Aniket Mokashi wrote:

Hi,

What would be the best way to write this script?
I have two datasets - huge (hkey, hdata), small(skey). I want to filter
all the data from huge dataset for which F(hdata, skey) is true.
Please advise.

For example,
huge = load 'mydata' as (key:chararray, value:chararray);
small = load 'smalldata' as skey:chararray;
h_s_cross = cross huge, small;
filtered = foreach h_s_cross generate CONTAINS(value, skey);

Thanks,
Aniket


Reply via email to