you can try dropduplicate function https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala
On 31 May 2018 at 16:34, <julio.ces...@free.fr> wrote: > Hi there ! > > I have a potentially large dataset ( regarding number of rows and cols ) > > And I want to find the fastest way to drop some useless cols for me, i.e. > cols containing only an unique value ! > > I want to know what do you think that I could do to do this as fast as > possible using spark. > > > I already have a solution using distinct().count() or approxCountDistinct() > But, they may not be the best choice as this requires to go through all > the data, even if the 2 first tested values for a col are already different > ( and in this case I know that I can keep the col ) > > > Thx for your ideas ! > > Julien > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >