Re: finding distinct count using dataframe

Kristina Rogale Plazonic Tue, 05 Jan 2016 06:28:48 -0800

I think it's an expression, rather than a function you'd find in the API
 (as a function you could do   df.select(col).distinct.count)


This will give you the number of distinct rows in both columns:
scala> df.select(countDistinct("name", "age"))
res397: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT name,age): bigint]

Whereas this will give you the number of distinct values in each column:
scala> df.select(countDistinct("name"), countDistinct("age"))
res398: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT name): bigint,
COUNT(DISTINCT age): bigint]

Of course, when you need many columns at once, this expression becomes
tedious, so I find it easiest to construct an sql statement from column
names, like so:

df.registerTempTable("df")
val sqlstatement = "select "+ df.columns.map( col => s"count (distinct
$col) as ${col}_distinct").mkString(", ") + " from df"
sqlContext.sql(sqlstatement)

But this is not efficient - see this Jira ticket
<https://issues.apache.org/jira/browse/SPARK-4243>and the fix.

On Tue, Jan 5, 2016 at 5:55 AM, Arunkumar Pillai <arunkumar1...@gmail.com>
wrote:

> Thanks Yanbo,
>
> Thanks for the help. But I'm not able to find countDistinct ot
> approxCountDistinct. function. These functions are within dataframe or any
> other package
>
> On Tue, Jan 5, 2016 at 3:24 PM, Yanbo Liang <yblia...@gmail.com> wrote:
>
>> Hi Arunkumar,
>>
>> You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
>> approxCountDistinct for a approximate result.
>>
>> 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai <arunkumar1...@gmail.com>:
>>
>>> Hi
>>>
>>> Is there any   functions to find distinct count of all the variables in
>>> dataframe.
>>>
>>> val sc = new SparkContext(conf) // spark context
>>> val options = Map("header" -> "true", "delimiter" -> delimiter, 
>>> "inferSchema" -> "true")
>>> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
>>> val datasetDF = 
>>> sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
>>>
>>>
>>> we are able to get the schema, variable data type. is there any method to 
>>> get the distinct count ?
>>>
>>>
>>>
>>> --
>>> Thanks and Regards
>>>         Arun
>>>
>>
>>
>
>
> --
> Thanks and Regards
>         Arun
>

Re: finding distinct count using dataframe

Reply via email to