Is this just frequent items? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97
On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska <ted.mala...@cloudera.com> wrote: > 100% I would love to do it. Who a good person to review the design with. > All I need is a quick chat about the design and approach and I'll create > the jira and push a patch. > > Ted Malaska > > On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> Hi Ted, >> The TopNList would be great to see directly in the Dataframe API and my >> wish would be to be able to apply it on multiple columns at the same time >> and get all these statistics. >> the .describe() function is close to what we want to achieve, maybe we >> could try to enrich its output. >> Anyway, even as a spark-package, if you could package your code for >> Dataframes, that would be great. >> >> Regards, >> >> Olivier. >> >> 2015-07-21 15:08 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com>: >> >>> Ha ok ! >>> >>> Then generic part would have that signature : >>> >>> def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] >>> >>> >>> +1 for more work (blog / api) for data quality checks. >>> >>> Cheers, >>> Jonathan >>> >>> >>> TopCMSParams and some other monoids from Algebird are really cool for >>> that : >>> >>> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 >>> >>> >>> On 21 July 2015 at 13:40, Ted Malaska <ted.mala...@cloudera.com> wrote: >>> >>>> I'm guessing you want something like what I put in this blog post. >>>> >>>> >>>> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ >>>> >>>> This is a very common use case. If there is a +1 I would love to add >>>> it to dataframes. >>>> >>>> Let me know >>>> Ted Malaska >>>> >>>> On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot < >>>> o.girar...@lateral-thoughts.com> wrote: >>>> >>>>> Yop, >>>>> actually the generic part does not work, the countByValue on one >>>>> column gives you the count for each value seen in the column. >>>>> I would like a generic (multi-column) countByValue to give me the same >>>>> kind of output for each column, not considering each n-uples of each >>>>> column >>>>> value as the key (which is what the groupBy is doing by default). >>>>> >>>>> Regards, >>>>> >>>>> Olivier >>>>> >>>>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy < >>>>> jonathan.wina...@gmail.com>: >>>>> >>>>>> Ahoy ! >>>>>> >>>>>> Maybe you can get countByValue by using sql.GroupedData : >>>>>> >>>>>> // some DFval df: DataFrame = >>>>>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", >>>>>> "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType)))) >>>>>> >>>>>> >>>>>> df.groupBy("n").count().show() >>>>>> >>>>>> >>>>>> // generic >>>>>> def countByValueDf(df:DataFrame) = { >>>>>> >>>>>> val (h :: r) = df.columns.toList >>>>>> >>>>>> df.groupBy(h, r:_*).count() >>>>>> } >>>>>> >>>>>> countByValueDf(df).show() >>>>>> >>>>>> >>>>>> Cheers, >>>>>> Jon >>>>>> >>>>>> On 20 July 2015 at 11:28, Olivier Girardot < >>>>>> o.girar...@lateral-thoughts.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> Is there any plan to add the countByValue function to Spark SQL >>>>>>> Dataframe ? >>>>>>> Even >>>>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 >>>>>>> is using the RDD part right now, but for ML purposes, being able to get >>>>>>> the >>>>>>> most frequent categorical value on multiple columns would be very >>>>>>> useful. >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Olivier Girardot* | Associé >>>>>>> o.girar...@lateral-thoughts.com >>>>>>> +33 6 24 09 17 94 >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Olivier Girardot* | Associé >>>>> o.girar...@lateral-thoughts.com >>>>> +33 6 24 09 17 94 >>>>> >>>> >>>> >>> >> >> >> -- >> *Olivier Girardot* | Associé >> o.girar...@lateral-thoughts.com >> +33 6 24 09 17 94 >> > >