I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Yop, > actually the generic part does not work, the countByValue on one column > gives you the count for each value seen in the column. > I would like a generic (multi-column) countByValue to give me the same > kind of output for each column, not considering each n-uples of each column > value as the key (which is what the groupBy is doing by default). > > Regards, > > Olivier > > 2015-07-20 14:18 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com>: > >> Ahoy ! >> >> Maybe you can get countByValue by using sql.GroupedData : >> >> // some DFval df: DataFrame = >> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", >> "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType)))) >> >> >> df.groupBy("n").count().show() >> >> >> // generic >> def countByValueDf(df:DataFrame) = { >> >> val (h :: r) = df.columns.toList >> >> df.groupBy(h, r:_*).count() >> } >> >> countByValueDf(df).show() >> >> >> Cheers, >> Jon >> >> On 20 July 2015 at 11:28, Olivier Girardot < >> o.girar...@lateral-thoughts.com> wrote: >> >>> Hi, >>> Is there any plan to add the countByValue function to Spark SQL >>> Dataframe ? >>> Even >>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 >>> is using the RDD part right now, but for ML purposes, being able to get the >>> most frequent categorical value on multiple columns would be very useful. >>> >>> >>> Regards, >>> >>> >>> -- >>> *Olivier Girardot* | AssociƩ >>> o.girar...@lateral-thoughts.com >>> +33 6 24 09 17 94 >>> >> >> > > > -- > *Olivier Girardot* | AssociƩ > o.girar...@lateral-thoughts.com > +33 6 24 09 17 94 >