May be you need to do below steps: 1) Swap key and value 2) Use sortByKey API 3) Swap key and value
4) Reduce result for top keys http://stackoverflow.com/questions/29003246/how-to-achieve-sort-by-value-in-spark-java On Sun, Jul 19, 2015 at 5:48 PM, N B <nb.nos...@gmail.com> wrote: > Hi Suyog, > > That code outputs the following: > > key2 val22 : 1 > key1 val1 : 2 > key2 val2 : 2 > > while the output I want to achieve would have been (with your example): > > key1 : 2 > key2 : 2 > > because there are 2 distinct types of values for each key ( regardless of > their actual duplicate counts .. hence the use of the DISTINCT keyword in > the query equivalent ). > > Thanks > Nikunj > > > On Sun, Jul 19, 2015 at 2:37 PM, suyog choudhari <suyogchoudh...@gmail.com > > wrote: > >> public static void main(String[] args) { >> >> SparkConf sparkConf = new SparkConf().setAppName("CountDistinct"); >> >> JavaSparkContext jsc = new JavaSparkContext(sparkConf); >> >> List<Tuple2<String, String>> list = new ArrayList<Tuple2<String, >> String>>(); >> >> list.add(new Tuple2<String, String>("key1", "val1")); >> >> list.add(new Tuple2<String, String>("key1", "val1")); >> >> list.add(new Tuple2<String, String>("key2", "val2")); >> >> list.add(new Tuple2<String, String>("key2", "val2")); >> >> list.add(new Tuple2<String, String>("key2", "val22")); >> >> JavaPairRDD<String, Integer> rdd = jsc.parallelize(list).mapToPair(t >> -> new Tuple2<String, Integer>(t._1 + " " +t._2, 1)); >> >> JavaPairRDD<String, Integer> rdd2 = rdd.reduceByKey((c1, c2) -> c1+c2 >> ); >> >> List<Tuple2<String, Integer>> output = rdd2.collect(); >> >> for (Tuple2<?,?> tuple : output) { >> >> System.out.println( tuple._1() + " : " + tuple._2() ); >> >> } >> >> } >> >> On Sun, Jul 19, 2015 at 2:28 PM, Jerry Lam <chiling...@gmail.com> wrote: >> >>> You mean this does not work? >>> >>> SELECT key, count(value) from table group by key >>> >>> >>> >>> On Sun, Jul 19, 2015 at 2:28 PM, N B <nb.nos...@gmail.com> wrote: >>> >>>> Hello, >>>> >>>> How do I go about performing the equivalent of the following SQL clause >>>> in Spark Streaming? I will be using this on a Windowed DStream. >>>> >>>> SELECT key, count(distinct(value)) from table group by key; >>>> >>>> so for example, given the following dataset in the table: >>>> >>>> key | value >>>> -----+------- >>>> k1 | v1 >>>> k1 | v1 >>>> k1 | v2 >>>> k1 | v3 >>>> k1 | v3 >>>> k2 | vv1 >>>> k2 | vv1 >>>> k2 | vv2 >>>> k2 | vv2 >>>> k2 | vv2 >>>> k3 | vvv1 >>>> k3 | vvv1 >>>> >>>> the result will be: >>>> >>>> key | count >>>> -----+------- >>>> k1 | 3 >>>> k2 | 2 >>>> k3 | 1 >>>> >>>> Thanks >>>> Nikunj >>>> >>>> >>> >> >