Thanks Gianmarco!!
here the final version is like this

by_clusters = GROUP sample_data by (cluster_id, terms);
by_clusters_terms_count = FOREACH by_clusters GENERATE FLATTEN(group)
as (cluster_id, terms), COUNT($1);

cheers

Arian Pasquali
http://about.me/arianpasquali


2014-07-29 13:23 GMT+01:00 Gianmarco De Francisci Morales <[email protected]>:

> Try this:
>
> by_clusters = GROUP sample_data by (cluster_id, terms);
> by_clusters_terms_count = FOREACH by_clusters GENERATE FLATTEN(group),
> COUNT(sample_data)
> as count;
>
> Cheers,
>
> --
> Gianmarco
>
>
> On 29 July 2014 11:49, Arian Pasquali <[email protected]> wrote:
>
> > Hi,
> >
> > I'm having trouble with a simple task that I believe someone out there
> must
> > have already solved some day.
> >
> > I'm trying to group and count the frequency of terms for each group in
> > PigLatin, but I'm having some troubles to figure it out how to do it.
> >
> > I have a collection of objects with the following schema:
> >
> > {cluster_id: bytearray,terms: chararray}
> >
> > And here is some samples
> >
> > (10, smerter)
> > (10, graviditeten)
> > (10, smerter)
> > (10, smerter)
> > (10, udemærket)
> > (20, eis feuer)
> > (20, herunterladen schau)
> > (20, download gratis)
> > (20, download gratis)
> > (30, anschauen kinofilm)
> > (30, kauf rechnung)
> > (30, kauf rechnung)
> > (30, versandkostenfreie lieferung)
> > (30, kostenlose)
> > (30, kostenlose)
> > (30, kostenlose)
> >
> > the result I m trying to get is something like this
> >
> > (10, smerter, 3)
> > (10, graviditeten, 2)
> > (10, udemærket, 1)
> > (20, download gratis, 2)
> > (20, eis feuer, 1)
> > (20, herunterladen schau, 1)
> > (30, kostenlose, 3)
> > (30, kauf rechnung, 2)
> > (30, anschauen kinofilm, 1)
> > (30, versandkostenfreie lieferung, 1)
> >
> > What would be the best way to do that? The following code groups by id
> and
> > count the terms, but I wanted to count the terms for each group.
> >
> > by_clusters = GROUP sample_data by cluster_id;
> > by_clusters_terms_count = FOREACH by_clusters GENERATE group as
> > cluster_id, COUNT($1);
> >
> > I make the grouping like this I end up with an object with the following
> > schema
> >
> > by_clusters: {group: bytearray,sample_data: {(cluster_id:
> > bytearray,terms: chararray)}}
> >
> > Now, I get to the point to actually count the terms inside the
> > 'sample_data' tuple. I'm thinking about nested foreach, but I still
> didn't
> > get it how could I apply it in this case. The code would be something
> like
> > the following:
> >
> > result = FOREACH by_clusters {
> >
> > --count terms here, I don't know how
> >
> > -- compiler gives me an error here
> > c = GROUP $1 BY terms; --
> > d = FOREACH c GENERATE COUNT(b), group;
> >
> > GENERATE cluster_id, d;
> > }
> >
> > Error I get:
> >
> > ERROR 1200: Syntax error, unexpected symbol at or near '$1
> >
> > Finally, I think I'm close, but I'm unable to solve it. I don't believe
> > I'll have to write an UDF in this case.
> >
> >
> > Arian
> >
>

Reply via email to