Re: countByValue on dataframe with multiple columns

Ted Malaska Tue, 21 Jul 2015 19:05:21 -0700

I added the following jira

https://issues.apache.org/jira/browse/SPARK-9237


Please help me get it assigned to myself thanks.

Ted Malaska

On Tue, Jul 21, 2015 at 7:53 PM, Ted Malaska <[email protected]>
wrote:

> Cool I will make a jira after I check in to my hotel.  And try to get a
> patch early next week.
> On Jul 21, 2015 5:15 PM, "Olivier Girardot" <
> [email protected]> wrote:
>
>> yes and freqItems does not give you an ordered count (right ?) + the
>> threshold makes it difficult to calibrate it + we noticed some strange
>> behaviour when testing it on small datasets.
>>
>> 2015-07-21 20:30 GMT+02:00 Ted Malaska <[email protected]>:
>>
>>> Look at the implementation for frequently items.  It is a different from
>>> true count.
>>> On Jul 21, 2015 1:19 PM, "Reynold Xin" <[email protected]> wrote:
>>>
>>>> Is this just frequent items?
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97
>>>>
>>>>
>>>>
>>>> On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska <[email protected]>
>>>> wrote:
>>>>
>>>>> 100% I would love to do it.  Who a good person to review the design
>>>>> with.  All I need is a quick chat about the design and approach and I'll
>>>>> create the jira and push a patch.
>>>>>
>>>>> Ted Malaska
>>>>>
>>>>> On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Ted,
>>>>>> The TopNList would be great to see directly in the Dataframe API and
>>>>>> my wish would be to be able to apply it on multiple columns at the same
>>>>>> time and get all these statistics.
>>>>>> the .describe() function is close to what we want to achieve, maybe
>>>>>> we could try to enrich its output.
>>>>>> Anyway, even as a spark-package, if you could package your code for
>>>>>> Dataframes, that would be great.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Olivier.
>>>>>>
>>>>>> 2015-07-21 15:08 GMT+02:00 Jonathan Winandy <
>>>>>> [email protected]>:
>>>>>>
>>>>>>> Ha ok !
>>>>>>>
>>>>>>> Then generic part would have that signature :
>>>>>>>
>>>>>>> def countColsByValue(df:Dataframe):Map[String /* colname
>>>>>>> */,Dataframe]
>>>>>>>
>>>>>>>
>>>>>>> +1 for more work (blog / api) for data quality checks.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Jonathan
>>>>>>>
>>>>>>>
>>>>>>> TopCMSParams and some other monoids from Algebird are really cool
>>>>>>> for that :
>>>>>>>
>>>>>>> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590
>>>>>>>
>>>>>>>
>>>>>>> On 21 July 2015 at 13:40, Ted Malaska <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm guessing you want something like what I put in this blog post.
>>>>>>>>
>>>>>>>>
>>>>>>>> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
>>>>>>>>
>>>>>>>> This is a very common use case.  If there is a +1 I would love to
>>>>>>>> add it to dataframes.
>>>>>>>>
>>>>>>>> Let me know
>>>>>>>> Ted Malaska
>>>>>>>>
>>>>>>>> On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Yop,
>>>>>>>>> actually the generic part does not work, the countByValue on one
>>>>>>>>> column gives you the count for each value seen in the column.
>>>>>>>>> I would like a generic (multi-column) countByValue to give me the
>>>>>>>>> same kind of output for each column, not considering each n-uples of 
>>>>>>>>> each
>>>>>>>>> column value as the key (which is what the groupBy is doing by 
>>>>>>>>> default).
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Olivier
>>>>>>>>>
>>>>>>>>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy <
>>>>>>>>> [email protected]>:
>>>>>>>>>
>>>>>>>>>> Ahoy !
>>>>>>>>>>
>>>>>>>>>> Maybe you can get countByValue by using sql.GroupedData :
>>>>>>>>>>
>>>>>>>>>> // some DFval df: DataFrame = 
>>>>>>>>>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", 
>>>>>>>>>> "A")).map(Row.apply(_)), StructType(List(StructField("n", 
>>>>>>>>>> StringType))))
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> df.groupBy("n").count().show()
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> // generic
>>>>>>>>>> def countByValueDf(df:DataFrame) = {
>>>>>>>>>>
>>>>>>>>>>   val (h :: r) = df.columns.toList
>>>>>>>>>>
>>>>>>>>>>   df.groupBy(h, r:_*).count()
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> countByValueDf(df).show()
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Jon
>>>>>>>>>>
>>>>>>>>>> On 20 July 2015 at 11:28, Olivier Girardot <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>> Is there any plan to add the countByValue function to Spark SQL
>>>>>>>>>>> Dataframe ?
>>>>>>>>>>> Even
>>>>>>>>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
>>>>>>>>>>> is using the RDD part right now, but for ML purposes, being able to 
>>>>>>>>>>> get the
>>>>>>>>>>> most frequent categorical value on multiple columns would be very 
>>>>>>>>>>> useful.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> *Olivier Girardot* | Associé
>>>>>>>>>>> [email protected]
>>>>>>>>>>> +33 6 24 09 17 94
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Olivier Girardot* | Associé
>>>>>>>>> [email protected]
>>>>>>>>> +33 6 24 09 17 94
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Olivier Girardot* | Associé
>>>>>> [email protected]
>>>>>> +33 6 24 09 17 94
>>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>> --
>> *Olivier Girardot* | Associé
>> [email protected]
>> +33 6 24 09 17 94
>>
>

Re: countByValue on dataframe with multiple columns

Reply via email to