Thanks Gianmarco!! here the final version is like this by_clusters = GROUP sample_data by (cluster_id, terms); by_clusters_terms_count = FOREACH by_clusters GENERATE FLATTEN(group) as (cluster_id, terms), COUNT($1);
cheers Arian Pasquali http://about.me/arianpasquali 2014-07-29 13:23 GMT+01:00 Gianmarco De Francisci Morales <[email protected]>: > Try this: > > by_clusters = GROUP sample_data by (cluster_id, terms); > by_clusters_terms_count = FOREACH by_clusters GENERATE FLATTEN(group), > COUNT(sample_data) > as count; > > Cheers, > > -- > Gianmarco > > > On 29 July 2014 11:49, Arian Pasquali <[email protected]> wrote: > > > Hi, > > > > I'm having trouble with a simple task that I believe someone out there > must > > have already solved some day. > > > > I'm trying to group and count the frequency of terms for each group in > > PigLatin, but I'm having some troubles to figure it out how to do it. > > > > I have a collection of objects with the following schema: > > > > {cluster_id: bytearray,terms: chararray} > > > > And here is some samples > > > > (10, smerter) > > (10, graviditeten) > > (10, smerter) > > (10, smerter) > > (10, udemærket) > > (20, eis feuer) > > (20, herunterladen schau) > > (20, download gratis) > > (20, download gratis) > > (30, anschauen kinofilm) > > (30, kauf rechnung) > > (30, kauf rechnung) > > (30, versandkostenfreie lieferung) > > (30, kostenlose) > > (30, kostenlose) > > (30, kostenlose) > > > > the result I m trying to get is something like this > > > > (10, smerter, 3) > > (10, graviditeten, 2) > > (10, udemærket, 1) > > (20, download gratis, 2) > > (20, eis feuer, 1) > > (20, herunterladen schau, 1) > > (30, kostenlose, 3) > > (30, kauf rechnung, 2) > > (30, anschauen kinofilm, 1) > > (30, versandkostenfreie lieferung, 1) > > > > What would be the best way to do that? The following code groups by id > and > > count the terms, but I wanted to count the terms for each group. > > > > by_clusters = GROUP sample_data by cluster_id; > > by_clusters_terms_count = FOREACH by_clusters GENERATE group as > > cluster_id, COUNT($1); > > > > I make the grouping like this I end up with an object with the following > > schema > > > > by_clusters: {group: bytearray,sample_data: {(cluster_id: > > bytearray,terms: chararray)}} > > > > Now, I get to the point to actually count the terms inside the > > 'sample_data' tuple. I'm thinking about nested foreach, but I still > didn't > > get it how could I apply it in this case. The code would be something > like > > the following: > > > > result = FOREACH by_clusters { > > > > --count terms here, I don't know how > > > > -- compiler gives me an error here > > c = GROUP $1 BY terms; -- > > d = FOREACH c GENERATE COUNT(b), group; > > > > GENERATE cluster_id, d; > > } > > > > Error I get: > > > > ERROR 1200: Syntax error, unexpected symbol at or near '$1 > > > > Finally, I think I'm close, but I'm unable to solve it. I don't believe > > I'll have to write an UDF in this case. > > > > > > Arian > > >
