TOKENIZE takes a string and returns a bag. It's issue is right now it
only allows you to split on whitespace. It would make sense to
generalize this to take a delimiter.
Alan.
On May 7, 2011, at 7:55 PM, Jacob Perkins wrote:
Dmitriy,
I see your point. It would definitely be nice to have a builtin for
returning a bag though. I'd actually be happy if
TOBAG(FLATTEN(STRSPLIT(X,','))) worked.
--jacob
@thedatachef
On Sat, 2011-05-07 at 18:41 -0700, Dmitriy Ryaboy wrote:
FWIW -- the reason STRSPLIT returns a Tuple is that the more common
case is thought to be splitting a string of a known format and trying
to get some part of it.
so, "foreach address_book generate STRSPLIT(phone_number, '-') as
(area_code, top_3, bottom_4);"
RegexExtractAll (whatever it's called these days) should return a
bag, iirc.
D
On Fri, May 6, 2011 at 2:59 PM, jacob <jacob.a.perk...@gmail.com>
wrote:
On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
#1) Let's say you are tracking messages and extracting the hash
tags from
the message and storing them as one field (#hash1#hash2#hash3).
This
means
you might have a line that looks something like the following:
2343 2011-05-06T03:04:00.000Z username
some+message+goes+here#with+#hash+#tags #with#hash#tags some
other
info
How can I get the # of tweets per hash tag? Also, how can I get
the # of
tweets per user per hash tag?
I know I can use the STRSPLIT function to split on '#'. That
will give me
a
bag of hash tags. How can I then group by these such that each
hash tag
has
a set of tweets?
You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP
BY' on
the hashtag itself.
If each message has an unknown number of hashtags, will a
'FLATTEN' given me
an unknown # of fields? If so, how do I know which field to group
by? I
don't want to group by messages that have the exact hash tags. I
want all
messages that have one of the hash tags.
Oh, that's right, STRSPLIT (rather uselessly) yields a nested
tuple and
NOT a bag. If you could get a bag then you could do the following
(I'm
throwing out some fields for now):
A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
hashtag;
C = GROUP B BY hastag;
Then C will contain a key (the hashtag) and a bag containing all the
tweets with that hashtag. You'll have to write 'MySplittingUDF'
yourself
to do the same as STRSPLIT but that returns a bag instead.
ie.
#foobar tweet text,#foobar
this tweet has #two #hashtags,#two#hashtags
another #foobar tweet,#foobar
will yield:
#foobar, {(#foobar tweet text, #foobar),(another #foobar tweet,
#foobar)}
#two, {(this tweet has #two #hashtags, #two)}
#hashtags, {(this tweet has #two #hashtags, #hashtags)}
But now I want to end up something like the following:
2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433
DIRECTIVE3
1983
If I knew the directives ahead of time, I know I can do
something like
the
following:
D = GROUP C BY date;
E = FOREACH D {
DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
'DIRECTIVE2',
COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
}
But how do I do this w/o having to hardcode the filters? Am I
thinking
about
this all wrong?
It's really a matter of how you structure your data ahead of time.
Imagine the data looking like this instead (call it X):
201101,directive1
201101,directive1
201101,directive2
201101,directive2
201101,directive2
201101,directive3
201102,directive2
201102,directive4
201103,directive1
This is how my data looks (row and column wise)
then, a simple:
Y = GROUP X BY (date,directive);
Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive),
COUNT(X) AS
num_occurrences;
would result in:
201101,directive1,2
201101,directive2,3
201101,directive3,1
201102,directive2,1
201102,directive4,1
201103,directive1,1
At least, that's what it _seems_ like you're asking for.
I've gotten that far. I'm actually asking for the being able to
put those
into columns and not rows.
--jacob
@thedatachef
Thanks Jacob!
-Christian
Thanks very much for you help,
Christian