Right now, I am doing it like below,

import scala.io.Source

val animalsFile = "/home/ajay/dataset/animal_types.txt"
val animalTypes = Source.fromFile(animalsFile).getLines.toArray

for ( anmtyp <- animalTypes ) {
      val distinctAnmTypCount = sqlContext.sql("select
count(distinct("+anmtyp+")) from TEST1 ")
      println("Calculating Metrics for Animal Type: "+anmtyp)
      if( distinctAnmTypCount.head().getAs[Long](0) <= 10 ){
        println("Animal Type: "+anmtyp+" has <= 10 distinct values")
      } else {
        println("Animal Type: "+anmtyp+" has > 10 distinct values")
      }
    }

But the problem is it is running sequentially.

Any inputs are appreciated. Thank you.


Regards,
Ajay


On Tue, Oct 4, 2016 at 7:44 PM, Ajay Chander <itsche...@gmail.com> wrote:

> Hi Everyone,
>
> I have a use-case where I have two Dataframes like below,
>
> 1) First Dataframe(DF1) contains,
>
> *    ANIMALS    *
> Mammals
> Birds
> Fish
> Reptiles
> Amphibians
>
> 2) Second Dataframe(DF2) contains,
>
> *    ID, Mammals, Birds, Fish, Reptiles, Amphibians    *
> 1,      Dogs,      Eagle,      Goldfish,      NULL,      Frog
> 2,      Cats,      Peacock,      Guppy,     Turtle,      Salamander
> 3,      Dolphins,      Eagle,      Zander,      NULL,      Frog
> 4,      Whales,      Parrot,      Guppy,      Snake,      Frog
> 5,      Horses,      Owl,      Guppy,      Snake,      Frog
> 6,      Dolphins,      Kingfisher,      Zander,      Turtle,      Frog
> 7,      Dogs,      Sparrow,      Goldfish,      NULL,      Salamander
>
> Now I want to take each row from DF1 and find out its distinct count in
> DF2. Example, pick Mammals from DF1 then find out count(distinct(Mammals))
> from DF2 i.e. 5
>
> DF1 has 70 distinct rows/Animal types
> DF2 has some million rows
>
> Whats the best way to achieve this efficiently using parallelism ?
>
> Any inputs are helpful. Thank you.
>
> Regards,
> Ajay
>
>

Reply via email to