I think it's an expression, rather than a function you'd find in the API (as a function you could do df.select(col).distinct.count)
This will give you the number of distinct rows in both columns: scala> df.select(countDistinct("name", "age")) res397: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT name,age): bigint] Whereas this will give you the number of distinct values in each column: scala> df.select(countDistinct("name"), countDistinct("age")) res398: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT name): bigint, COUNT(DISTINCT age): bigint] Of course, when you need many columns at once, this expression becomes tedious, so I find it easiest to construct an sql statement from column names, like so: df.registerTempTable("df") val sqlstatement = "select "+ df.columns.map( col => s"count (distinct $col) as ${col}_distinct").mkString(", ") + " from df" sqlContext.sql(sqlstatement) But this is not efficient - see this Jira ticket <https://issues.apache.org/jira/browse/SPARK-4243>and the fix. On Tue, Jan 5, 2016 at 5:55 AM, Arunkumar Pillai <arunkumar1...@gmail.com> wrote: > Thanks Yanbo, > > Thanks for the help. But I'm not able to find countDistinct ot > approxCountDistinct. function. These functions are within dataframe or any > other package > > On Tue, Jan 5, 2016 at 3:24 PM, Yanbo Liang <yblia...@gmail.com> wrote: > >> Hi Arunkumar, >> >> You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or >> approxCountDistinct for a approximate result. >> >> 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai <arunkumar1...@gmail.com>: >> >>> Hi >>> >>> Is there any functions to find distinct count of all the variables in >>> dataframe. >>> >>> val sc = new SparkContext(conf) // spark context >>> val options = Map("header" -> "true", "delimiter" -> delimiter, >>> "inferSchema" -> "true") >>> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context >>> val datasetDF = >>> sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile) >>> >>> >>> we are able to get the schema, variable data type. is there any method to >>> get the distinct count ? >>> >>> >>> >>> -- >>> Thanks and Regards >>> Arun >>> >> >> > > > -- > Thanks and Regards > Arun >