Re: countByValue on dataframe with multiple columns

Reynold Xin Tue, 21 Jul 2015 10:20:41 -0700

Is this just frequent items?

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97




On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska <ted.mala...@cloudera.com>
wrote:

> 100% I would love to do it.  Who a good person to review the design with.
> All I need is a quick chat about the design and approach and I'll create
> the jira and push a patch.
>
> Ted Malaska
>
> On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> Hi Ted,
>> The TopNList would be great to see directly in the Dataframe API and my
>> wish would be to be able to apply it on multiple columns at the same time
>> and get all these statistics.
>> the .describe() function is close to what we want to achieve, maybe we
>> could try to enrich its output.
>> Anyway, even as a spark-package, if you could package your code for
>> Dataframes, that would be great.
>>
>> Regards,
>>
>> Olivier.
>>
>> 2015-07-21 15:08 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com>:
>>
>>> Ha ok !
>>>
>>> Then generic part would have that signature :
>>>
>>> def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]
>>>
>>>
>>> +1 for more work (blog / api) for data quality checks.
>>>
>>> Cheers,
>>> Jonathan
>>>
>>>
>>> TopCMSParams and some other monoids from Algebird are really cool for
>>> that :
>>>
>>> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590
>>>
>>>
>>> On 21 July 2015 at 13:40, Ted Malaska <ted.mala...@cloudera.com> wrote:
>>>
>>>> I'm guessing you want something like what I put in this blog post.
>>>>
>>>>
>>>> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
>>>>
>>>> This is a very common use case.  If there is a +1 I would love to add
>>>> it to dataframes.
>>>>
>>>> Let me know
>>>> Ted Malaska
>>>>
>>>> On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot <
>>>> o.girar...@lateral-thoughts.com> wrote:
>>>>
>>>>> Yop,
>>>>> actually the generic part does not work, the countByValue on one
>>>>> column gives you the count for each value seen in the column.
>>>>> I would like a generic (multi-column) countByValue to give me the same
>>>>> kind of output for each column, not considering each n-uples of each 
>>>>> column
>>>>> value as the key (which is what the groupBy is doing by default).
>>>>>
>>>>> Regards,
>>>>>
>>>>> Olivier
>>>>>
>>>>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy <
>>>>> jonathan.wina...@gmail.com>:
>>>>>
>>>>>> Ahoy !
>>>>>>
>>>>>> Maybe you can get countByValue by using sql.GroupedData :
>>>>>>
>>>>>> // some DFval df: DataFrame = 
>>>>>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", 
>>>>>> "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType))))
>>>>>>
>>>>>>
>>>>>> df.groupBy("n").count().show()
>>>>>>
>>>>>>
>>>>>> // generic
>>>>>> def countByValueDf(df:DataFrame) = {
>>>>>>
>>>>>>   val (h :: r) = df.columns.toList
>>>>>>
>>>>>>   df.groupBy(h, r:_*).count()
>>>>>> }
>>>>>>
>>>>>> countByValueDf(df).show()
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Jon
>>>>>>
>>>>>> On 20 July 2015 at 11:28, Olivier Girardot <
>>>>>> o.girar...@lateral-thoughts.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> Is there any plan to add the countByValue function to Spark SQL
>>>>>>> Dataframe ?
>>>>>>> Even
>>>>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
>>>>>>> is using the RDD part right now, but for ML purposes, being able to get 
>>>>>>> the
>>>>>>> most frequent categorical value on multiple columns would be very 
>>>>>>> useful.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Olivier Girardot* | Associé
>>>>>>> o.girar...@lateral-thoughts.com
>>>>>>> +33 6 24 09 17 94
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Olivier Girardot* | Associé
>>>>> o.girar...@lateral-thoughts.com
>>>>> +33 6 24 09 17 94
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Olivier Girardot* | Associé
>> o.girar...@lateral-thoughts.com
>> +33 6 24 09 17 94
>>
>
>

Re: countByValue on dataframe with multiple columns

Reply via email to