Hi Pig users,
Is there an easy/efficient way to sample an inner bag? For example, with input
in a relation like
(id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)})
(id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)})
(id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)})
I’d like to sample 1/3 the elements of the bags, and get something like
(ignoring the non-determinism)
(id1,att1,{(x,0.999749968742)})
(id1,att2,{(b,0.04)})
(id2,att1,{(b,0.05)})
I have a circumlocution that seems to work using flatten+ group but that looks
ugly to me:
tfidf1 = load '$tfidf' as (id: chararray,
att: chararray,
pairs: {pair: (word: chararray, value: double)});
flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs);
sample_flat_tfidf = sample flat_tfidf 0.33;
tfidf2 = group sample_flat_tfidf by (id, att);
tfidf = foreach tfidf2 {
pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value;
generate group.id, group.att, pairs;
};
Can someone suggest a better way to do this? Many thanks!
William F Dowling
Senior Technologist
Thomson Reuters