Right now, I am doing it like below, import scala.io.Source
val animalsFile = "/home/ajay/dataset/animal_types.txt" val animalTypes = Source.fromFile(animalsFile).getLines.toArray for ( anmtyp <- animalTypes ) { val distinctAnmTypCount = sqlContext.sql("select count(distinct("+anmtyp+")) from TEST1 ") println("Calculating Metrics for Animal Type: "+anmtyp) if( distinctAnmTypCount.head().getAs[Long](0) <= 10 ){ println("Animal Type: "+anmtyp+" has <= 10 distinct values") } else { println("Animal Type: "+anmtyp+" has > 10 distinct values") } } But the problem is it is running sequentially. Any inputs are appreciated. Thank you. Regards, Ajay On Tue, Oct 4, 2016 at 7:44 PM, Ajay Chander <itsche...@gmail.com> wrote: > Hi Everyone, > > I have a use-case where I have two Dataframes like below, > > 1) First Dataframe(DF1) contains, > > * ANIMALS * > Mammals > Birds > Fish > Reptiles > Amphibians > > 2) Second Dataframe(DF2) contains, > > * ID, Mammals, Birds, Fish, Reptiles, Amphibians * > 1, Dogs, Eagle, Goldfish, NULL, Frog > 2, Cats, Peacock, Guppy, Turtle, Salamander > 3, Dolphins, Eagle, Zander, NULL, Frog > 4, Whales, Parrot, Guppy, Snake, Frog > 5, Horses, Owl, Guppy, Snake, Frog > 6, Dolphins, Kingfisher, Zander, Turtle, Frog > 7, Dogs, Sparrow, Goldfish, NULL, Salamander > > Now I want to take each row from DF1 and find out its distinct count in > DF2. Example, pick Mammals from DF1 then find out count(distinct(Mammals)) > from DF2 i.e. 5 > > DF1 has 70 distinct rows/Animal types > DF2 has some million rows > > Whats the best way to achieve this efficiently using parallelism ? > > Any inputs are helpful. Thank you. > > Regards, > Ajay > >