Re: Counting distinct values for a key?

suyog choudhari Sun, 19 Jul 2015 18:26:41 -0700

May be you need to do below steps:

1) Swap key and value
2) Use sortByKey API
3) Swap key and value


4) Reduce result for top keys

http://stackoverflow.com/questions/29003246/how-to-achieve-sort-by-value-in-spark-java




On Sun, Jul 19, 2015 at 5:48 PM, N B <nb.nos...@gmail.com> wrote:

> Hi Suyog,
>
> That code outputs the following:
>
> key2 val22 : 1
> key1 val1 : 2
> key2 val2 : 2
>
> while the output I want to achieve would have been (with your example):
>
> key1 : 2
> key2 : 2
>
> because there are 2 distinct types of values for each key ( regardless of
> their actual duplicate counts .. hence the use of the DISTINCT keyword in
> the query equivalent ).
>
> Thanks
> Nikunj
>
>
> On Sun, Jul 19, 2015 at 2:37 PM, suyog choudhari <suyogchoudh...@gmail.com
> > wrote:
>
>> public static void main(String[] args) {
>>
>>  SparkConf sparkConf = new SparkConf().setAppName("CountDistinct");
>>
>>  JavaSparkContext jsc = new JavaSparkContext(sparkConf);
>>
>>   List<Tuple2<String, String>> list = new ArrayList<Tuple2<String,
>> String>>();
>>
>>   list.add(new Tuple2<String, String>("key1", "val1"));
>>
>>  list.add(new Tuple2<String, String>("key1", "val1"));
>>
>>  list.add(new Tuple2<String, String>("key2", "val2"));
>>
>>  list.add(new Tuple2<String, String>("key2", "val2"));
>>
>>  list.add(new Tuple2<String, String>("key2", "val22"));
>>
>>     JavaPairRDD<String, Integer> rdd =  jsc.parallelize(list).mapToPair(t
>> -> new Tuple2<String, Integer>(t._1 + " " +t._2, 1));
>>
>>   JavaPairRDD<String, Integer> rdd2 = rdd.reduceByKey((c1, c2) -> c1+c2
>> );
>>
>>     List<Tuple2<String, Integer>> output =  rdd2.collect();
>>
>>   for (Tuple2<?,?> tuple : output) {
>>
>>         System.out.println( tuple._1() + " : " + tuple._2() );
>>
>>     }
>>
>>   }
>>
>> On Sun, Jul 19, 2015 at 2:28 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> You mean this does not work?
>>>
>>> SELECT key, count(value) from table group by key
>>>
>>>
>>>
>>> On Sun, Jul 19, 2015 at 2:28 PM, N B <nb.nos...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> How do I go about performing the equivalent of the following SQL clause
>>>> in Spark Streaming? I will be using this on a Windowed DStream.
>>>>
>>>> SELECT key, count(distinct(value)) from table group by key;
>>>>
>>>> so for example, given the following dataset in the table:
>>>>
>>>>  key | value
>>>> -----+-------
>>>>  k1  | v1
>>>>  k1  | v1
>>>>  k1  | v2
>>>>  k1  | v3
>>>>  k1  | v3
>>>>  k2  | vv1
>>>>  k2  | vv1
>>>>  k2  | vv2
>>>>  k2  | vv2
>>>>  k2  | vv2
>>>>  k3  | vvv1
>>>>  k3  | vvv1
>>>>
>>>> the result will be:
>>>>
>>>>  key | count
>>>> -----+-------
>>>>  k1  |     3
>>>>  k2  |     2
>>>>  k3  |     1
>>>>
>>>> Thanks
>>>> Nikunj
>>>>
>>>>
>>>
>>
>

Re: Counting distinct values for a key?

Reply via email to