Then I think the array_contains is your best bet. See my example query below. I used a subquery (always handy) also I replaced count(activityID) in your SQL with count(1) The results returned are exactly the same (at least in my data set) and for some reason, my cluster seemed to run faster with count(1) I am not sure why. That and it didn't make sense to me to count the field I am grouping on, it's confusing, are you looking for the count of that grouped field or the count of the number of grouped fields (obviously the first is returned, but from a "reading" point of view a person reading/viewing your query may be confused).
Original select activityId, count(activityId), *find_in_set("CCC", collect_set(msgBody))* from ActivityStream group by activityId; Suggested: select * from ( select activityId, count(1) as activityid_count, collect_set(msgBody) as msg_bodies from ActivityStream group by activityId ) b where array_contains(msg_bodies, "CCC") > 0 On Tue, Jul 17, 2012 at 4:32 AM, Tharindu Mathew <mcclou...@gmail.com>wrote: > I'm basically trying to search for a string (ex: "CCC") for a set of > strings (ex: * *collect_set(msgBody)*)* that comes as a result of a group > by query. > > > On Tue, Jul 17, 2012 at 8:50 AM, John Omernik <j...@omernik.com> wrote: > >> Not sure what you are trying to do, but you may want to check out the >> array_contains function. Also, if you are using Hive 9 you can use the >> concat_ws() function. This is taken from a google search: >> >> select concat_ws(‘.’, array(‘www’,’apache’,’org’)) from src limit 1; >> www.apache.org >> >> >> https://cwiki.apache.org/Hive/presentations.data/WhatsNewInHive090HadoopSummit2012BoF.pdf >> >> On the array_contains: >> >> >> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-CollectionFunctions >> >> boolean array_contains(Array<T>, value)Returns TRUE if the array >> contains value >> >> >> >> On Mon, Jul 16, 2012 at 3:56 PM, Tharindu Mathew <mcclou...@gmail.com>wrote: >> >>> Hi everyone, >>> >>> I'd like to do $subject and was approaching it with the following query: >>> >>> select activityId, count(activityId), *find_in_set("CCC", >>> collect_set(msgBody))* from ActivityStream group by activityId; >>> >>> But find_in_set doesn't seem to accept arrays. Is there a way to cast >>> this string array into a string list or a string so I can conduct a search? >>> Maybe, there's another way to do this. >>> >>> Thanks in advance. >>> >>> -- >>> Regards, >>> >>> Tharindu >>> >>> blog: http://mackiemathew.com/ >>> >>> >> > > > -- > Regards, > > Tharindu > > blog: http://mackiemathew.com/ > >