Re: Working with an unknown number of values

Alan Gates Tue, 10 May 2011 14:34:56 -0700

TOKENIZE takes a string and returns a bag. It's issue is right now itonly allows you to split on whitespace. It would make sense togeneralize this to take a delimiter.


Alan.


On May 7, 2011, at 7:55 PM, Jacob Perkins wrote:

Dmitriy,

  I see your point. It would definitely be nice to have a builtin for
returning a bag though. I'd actually be happy if
TOBAG(FLATTEN(STRSPLIT(X,','))) worked.

--jacob
@thedatachef

On Sat, 2011-05-07 at 18:41 -0700, Dmitriy Ryaboy wrote:
FWIW -- the reason STRSPLIT returns a Tuple is that the more common
case is thought to be splitting a string of a known format and trying
to get some part of it.

so, "foreach address_book generate STRSPLIT(phone_number, '-') as
(area_code, top_3, bottom_4);"
RegexExtractAll (whatever it's called these days) should return abag, iirc.
D
On Fri, May 6, 2011 at 2:59 PM, jacob <jacob.a.perk...@gmail.com>wrote:
On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
#1) Let's say you are tracking messages and extracting the hashtags fromthe message and storing them as one field (#hash1#hash2#hash3).This
means
you might have a line that looks something like the following:
     2343    2011-05-06T03:04:00.000Z    username
some+message+goes+here#with+#hash+#tags    #with#hash#tags   some
other
info
How can I get the # of tweets per hash tag? Also, how can I getthe # of
tweets per user per hash tag?
I know I can use the STRSPLIT function to split on '#'. Thatwill give me
a
bag of hash tags. How can I then group by these such that eachhash tag
has
a set of tweets?
You will need to 'FLATTEN' the bag of hashtags then do a 'GROUPBY' on
the hashtag itself.
If each message has an unknown number of hashtags, will a'FLATTEN' given mean unknown # of fields? If so, how do I know which field to groupby? Idon't want to group by messages that have the exact hash tags. Iwant all
messages that have one of the hash tags.
Oh, that's right, STRSPLIT (rather uselessly) yields a nestedtuple andNOT a bag. If you could get a bag then you could do the following(I'm
throwing out some fields for now):

A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
hashtag;
C = GROUP B BY hastag;

Then C will contain a key (the hashtag) and a bag containing all the
tweets with that hashtag. You'll have to write 'MySplittingUDF'yourself
to do the same as STRSPLIT but that returns a bag instead.

ie.

#foobar tweet text,#foobar
this tweet has #two #hashtags,#two#hashtags
another #foobar tweet,#foobar

will yield:

#foobar,   {(#foobar tweet text, #foobar),(another #foobar tweet,
#foobar)}
#two,      {(this tweet has #two #hashtags, #two)}
#hashtags, {(this tweet has #two #hashtags, #hashtags)}
   But now I want to end up something like the following:
2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433DIRECTIVE3
1983
If I knew the directives ahead of time, I know I can dosomething like
the
following:

D = GROUP C BY date;

E = FOREACH D {
    DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
    DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
    DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
       GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
'DIRECTIVE2',
COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
}
But how do I do this w/o having to hardcode the filters? Am Ithinking
about
this all wrong?
It's really a matter of how you structure your data ahead of time.
Imagine the data looking like this instead (call it X):

201101,directive1
201101,directive1
201101,directive2
201101,directive2
201101,directive2
201101,directive3
201102,directive2
201102,directive4
201103,directive1

This is how my data looks (row and column wise)
then, a simple:

Y = GROUP X BY (date,directive);
Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive),COUNT(X) AS
num_occurrences;

would result in:

201101,directive1,2
201101,directive2,3
201101,directive3,1
201102,directive2,1
201102,directive4,1
201103,directive1,1

At least, that's what it _seems_ like you're asking for.
I've gotten that far. I'm actually asking for the being able toput those
into columns and not rows.
--jacob
@thedatachef

Thanks Jacob!
-Christian
Thanks very much for you help,
Christian

Re: Working with an unknown number of values

Reply via email to