I added the following jira https://issues.apache.org/jira/browse/SPARK-9237
Please help me get it assigned to myself thanks. Ted Malaska On Tue, Jul 21, 2015 at 7:53 PM, Ted Malaska <ted.mala...@cloudera.com> wrote: > Cool I will make a jira after I check in to my hotel. And try to get a > patch early next week. > On Jul 21, 2015 5:15 PM, "Olivier Girardot" < > o.girar...@lateral-thoughts.com> wrote: > >> yes and freqItems does not give you an ordered count (right ?) + the >> threshold makes it difficult to calibrate it + we noticed some strange >> behaviour when testing it on small datasets. >> >> 2015-07-21 20:30 GMT+02:00 Ted Malaska <ted.mala...@cloudera.com>: >> >>> Look at the implementation for frequently items. It is a different from >>> true count. >>> On Jul 21, 2015 1:19 PM, "Reynold Xin" <r...@databricks.com> wrote: >>> >>>> Is this just frequent items? >>>> >>>> >>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 >>>> >>>> >>>> >>>> On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska <ted.mala...@cloudera.com> >>>> wrote: >>>> >>>>> 100% I would love to do it. Who a good person to review the design >>>>> with. All I need is a quick chat about the design and approach and I'll >>>>> create the jira and push a patch. >>>>> >>>>> Ted Malaska >>>>> >>>>> On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot < >>>>> o.girar...@lateral-thoughts.com> wrote: >>>>> >>>>>> Hi Ted, >>>>>> The TopNList would be great to see directly in the Dataframe API and >>>>>> my wish would be to be able to apply it on multiple columns at the same >>>>>> time and get all these statistics. >>>>>> the .describe() function is close to what we want to achieve, maybe >>>>>> we could try to enrich its output. >>>>>> Anyway, even as a spark-package, if you could package your code for >>>>>> Dataframes, that would be great. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Olivier. >>>>>> >>>>>> 2015-07-21 15:08 GMT+02:00 Jonathan Winandy < >>>>>> jonathan.wina...@gmail.com>: >>>>>> >>>>>>> Ha ok ! >>>>>>> >>>>>>> Then generic part would have that signature : >>>>>>> >>>>>>> def countColsByValue(df:Dataframe):Map[String /* colname >>>>>>> */,Dataframe] >>>>>>> >>>>>>> >>>>>>> +1 for more work (blog / api) for data quality checks. >>>>>>> >>>>>>> Cheers, >>>>>>> Jonathan >>>>>>> >>>>>>> >>>>>>> TopCMSParams and some other monoids from Algebird are really cool >>>>>>> for that : >>>>>>> >>>>>>> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 >>>>>>> >>>>>>> >>>>>>> On 21 July 2015 at 13:40, Ted Malaska <ted.mala...@cloudera.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I'm guessing you want something like what I put in this blog post. >>>>>>>> >>>>>>>> >>>>>>>> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ >>>>>>>> >>>>>>>> This is a very common use case. If there is a +1 I would love to >>>>>>>> add it to dataframes. >>>>>>>> >>>>>>>> Let me know >>>>>>>> Ted Malaska >>>>>>>> >>>>>>>> On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot < >>>>>>>> o.girar...@lateral-thoughts.com> wrote: >>>>>>>> >>>>>>>>> Yop, >>>>>>>>> actually the generic part does not work, the countByValue on one >>>>>>>>> column gives you the count for each value seen in the column. >>>>>>>>> I would like a generic (multi-column) countByValue to give me the >>>>>>>>> same kind of output for each column, not considering each n-uples of >>>>>>>>> each >>>>>>>>> column value as the key (which is what the groupBy is doing by >>>>>>>>> default). >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Olivier >>>>>>>>> >>>>>>>>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy < >>>>>>>>> jonathan.wina...@gmail.com>: >>>>>>>>> >>>>>>>>>> Ahoy ! >>>>>>>>>> >>>>>>>>>> Maybe you can get countByValue by using sql.GroupedData : >>>>>>>>>> >>>>>>>>>> // some DFval df: DataFrame = >>>>>>>>>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", >>>>>>>>>> "A")).map(Row.apply(_)), StructType(List(StructField("n", >>>>>>>>>> StringType)))) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> df.groupBy("n").count().show() >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> // generic >>>>>>>>>> def countByValueDf(df:DataFrame) = { >>>>>>>>>> >>>>>>>>>> val (h :: r) = df.columns.toList >>>>>>>>>> >>>>>>>>>> df.groupBy(h, r:_*).count() >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> countByValueDf(df).show() >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Jon >>>>>>>>>> >>>>>>>>>> On 20 July 2015 at 11:28, Olivier Girardot < >>>>>>>>>> o.girar...@lateral-thoughts.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> Is there any plan to add the countByValue function to Spark SQL >>>>>>>>>>> Dataframe ? >>>>>>>>>>> Even >>>>>>>>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 >>>>>>>>>>> is using the RDD part right now, but for ML purposes, being able to >>>>>>>>>>> get the >>>>>>>>>>> most frequent categorical value on multiple columns would be very >>>>>>>>>>> useful. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> *Olivier Girardot* | Associé >>>>>>>>>>> o.girar...@lateral-thoughts.com >>>>>>>>>>> +33 6 24 09 17 94 >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Olivier Girardot* | Associé >>>>>>>>> o.girar...@lateral-thoughts.com >>>>>>>>> +33 6 24 09 17 94 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Olivier Girardot* | Associé >>>>>> o.girar...@lateral-thoughts.com >>>>>> +33 6 24 09 17 94 >>>>>> >>>>> >>>>> >>>> >> >> >> -- >> *Olivier Girardot* | Associé >> o.girar...@lateral-thoughts.com >> +33 6 24 09 17 94 >> >