Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
I added the following jira https://issues.apache.org/jira/browse/SPARK-9237 Please help me get it assigned to myself thanks. Ted Malaska On Tue, Jul 21, 2015 at 7:53 PM, Ted Malaska wrote: > Cool I will make a jira after I check in to my hotel. And try to get a > patch early next week. > On

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
Cool I will make a jira after I check in to my hotel. And try to get a patch early next week. On Jul 21, 2015 5:15 PM, "Olivier Girardot" wrote: > yes and freqItems does not give you an ordered count (right ?) + the > threshold makes it difficult to calibrate it + we noticed some strange > behav

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Olivier Girardot
yes and freqItems does not give you an ordered count (right ?) + the threshold makes it difficult to calibrate it + we noticed some strange behaviour when testing it on small datasets. 2015-07-21 20:30 GMT+02:00 Ted Malaska : > Look at the implementation for frequently items. It is a different f

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
Look at the implementation for frequently items. It is a different from true count. On Jul 21, 2015 1:19 PM, "Reynold Xin" wrote: > Is this just frequent items? > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 > > > >

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Reynold Xin
Is this just frequent items? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska wrote: > 100% I would love to do it. Who a good person to review the design with. > All I need i

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch. Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi Ted, > The T

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Olivier Girardot
Hi Ted, The TopNList would be great to see directly in the Dataframe API and my wish would be to be able to apply it on multiple columns at the same time and get all these statistics. the .describe() function is close to what we want to achieve, maybe we could try to enrich its output. Anyway, even

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Jonathan Winandy
Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Olivier Girardot
Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (wh

Re: countByValue on dataframe with multiple columns

2015-07-20 Thread Jonathan Winandy
Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType df.groupBy("n").count().show() // generic def countByValueDf(df:

countByValue on dataframe with multiple columns

2015-07-20 Thread Olivier Girardot
Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categor