Re: countByValue on dataframe with multiple columns

Ted Malaska Tue, 21 Jul 2015 04:40:53 -0700

I'm guessing you want something like what I put in this blog post.

http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/


This is a very common use case.  If there is a +1 I would love to add it to
dataframes.

Let me know
Ted Malaska

On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Yop,
> actually the generic part does not work, the countByValue on one column
> gives you the count for each value seen in the column.
> I would like a generic (multi-column) countByValue to give me the same
> kind of output for each column, not considering each n-uples of each column
> value as the key (which is what the groupBy is doing by default).
>
> Regards,
>
> Olivier
>
> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com>:
>
>> Ahoy !
>>
>> Maybe you can get countByValue by using sql.GroupedData :
>>
>> // some DFval df: DataFrame = 
>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", 
>> "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType))))
>>
>>
>> df.groupBy("n").count().show()
>>
>>
>> // generic
>> def countByValueDf(df:DataFrame) = {
>>
>>   val (h :: r) = df.columns.toList
>>
>>   df.groupBy(h, r:_*).count()
>> }
>>
>> countByValueDf(df).show()
>>
>>
>> Cheers,
>> Jon
>>
>> On 20 July 2015 at 11:28, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Hi,
>>> Is there any plan to add the countByValue function to Spark SQL
>>> Dataframe ?
>>> Even
>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
>>> is using the RDD part right now, but for ML purposes, being able to get the
>>> most frequent categorical value on multiple columns would be very useful.
>>>
>>>
>>> Regards,
>>>
>>>
>>> --
>>> *Olivier Girardot* | Associé
>>> o.girar...@lateral-thoughts.com
>>> +33 6 24 09 17 94
>>>
>>
>>
>
>
> --
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
> +33 6 24 09 17 94
>

Re: countByValue on dataframe with multiple columns

Reply via email to