Hi Nikunj, Sorry, I totally misread your question. I think you need to first groupbykey (get all values of the same key together), then follow by mapValues (probably put the values into a set and then take the size of it because you want a distinct count)
HTH, Jerry Sent from my iPhone > On 19 Jul, 2015, at 8:48 pm, N B <nb.nos...@gmail.com> wrote: > > Hi Suyog, > > That code outputs the following: > > key2 val22 : 1 > key1 val1 : 2 > key2 val2 : 2 > > while the output I want to achieve would have been (with your example): > > key1 : 2 > key2 : 2 > > because there are 2 distinct types of values for each key ( regardless of > their actual duplicate counts .. hence the use of the DISTINCT keyword in the > query equivalent ). > > Thanks > Nikunj > > >> On Sun, Jul 19, 2015 at 2:37 PM, suyog choudhari <suyogchoudh...@gmail.com> >> wrote: >> public static void main(String[] args) { >> >> SparkConf sparkConf = new SparkConf().setAppName("CountDistinct"); >> >> JavaSparkContext jsc = new JavaSparkContext(sparkConf); >> >> >> List<Tuple2<String, String>> list = new ArrayList<Tuple2<String, String>>(); >> >> >> list.add(new Tuple2<String, String>("key1", "val1")); >> >> list.add(new Tuple2<String, String>("key1", "val1")); >> >> list.add(new Tuple2<String, String>("key2", "val2")); >> >> list.add(new Tuple2<String, String>("key2", "val2")); >> >> list.add(new Tuple2<String, String>("key2", "val22")); >> >> >> >> JavaPairRDD<String, Integer> rdd = jsc.parallelize(list).mapToPair(t -> new >> Tuple2<String, Integer>(t._1 + " " +t._2, 1)); >> >> >> JavaPairRDD<String, Integer> rdd2 = rdd.reduceByKey((c1, c2) -> c1+c2 ); >> >> >> >> List<Tuple2<String, Integer>> output = rdd2.collect(); >> >> >> for (Tuple2<?,?> tuple : output) { >> >> System.out.println( tuple._1() + " : " + tuple._2() ); >> >> } >> >> >> } >> >> >>> On Sun, Jul 19, 2015 at 2:28 PM, Jerry Lam <chiling...@gmail.com> wrote: >>> You mean this does not work? >>> >>> SELECT key, count(value) from table group by key >>> >>> >>> >>>> On Sun, Jul 19, 2015 at 2:28 PM, N B <nb.nos...@gmail.com> wrote: >>>> Hello, >>>> >>>> How do I go about performing the equivalent of the following SQL clause in >>>> Spark Streaming? I will be using this on a Windowed DStream. >>>> >>>> SELECT key, count(distinct(value)) from table group by key; >>>> >>>> so for example, given the following dataset in the table: >>>> >>>> key | value >>>> -----+------- >>>> k1 | v1 >>>> k1 | v1 >>>> k1 | v2 >>>> k1 | v3 >>>> k1 | v3 >>>> k2 | vv1 >>>> k2 | vv1 >>>> k2 | vv2 >>>> k2 | vv2 >>>> k2 | vv2 >>>> k3 | vvv1 >>>> k3 | vvv1 >>>> >>>> the result will be: >>>> >>>> key | count >>>> -----+------- >>>> k1 | 3 >>>> k2 | 2 >>>> k3 | 1 >>>> >>>> Thanks >>>> Nikunj >