+ user@spark.apache.org Hi Daniel, I will try this one out and let you know. Thank you.
On Wed, Oct 5, 2016 at 9:50 AM, Daniel Siegmann < dsiegm...@securityscorecard.io> wrote: > I think it's fine to read animal types locally because there are only 70 > of them. It's just that you want to execute the Spark actions in parallel. > The easiest way to do that is to have only a single action. > > Instead of grabbing the result right away, I would just add a column for > the animal type and union the datasets for the animal types. Something like > this (not sure if the syntax is correct): > > val animalCounts: DataFrame = animalTypes.map { anmtyp => > sqlContext.sql("select lit("+anmtyp+") as animal_type, > count(distinct("+anmtyp+")) from TEST1 ") > }.reduce(_.union(_)) > > animalCounts.foreach( /* print the output */ ) > > On Wed, Oct 5, 2016 at 12:42 AM, Daniel <daniel.ti...@gmail.com> wrote: > >> First of all, if you want to read a txt file in Spark, you should use >> sc.textFile, because you are using "Source.fromFile", so you are reading it >> with Scala standard api, so it will be read sequentially. >> >> Furthermore you are going to need create a schema if you want to use >> dataframes. >> >> El 5/10/2016 1:53, "Ajay Chander" <itsche...@gmail.com> escribió: >> >>> Right now, I am doing it like below, >>> >>> import scala.io.Source >>> >>> val animalsFile = "/home/ajay/dataset/animal_types.txt" >>> val animalTypes = Source.fromFile(animalsFile).getLines.toArray >>> >>> for ( anmtyp <- animalTypes ) { >>> val distinctAnmTypCount = sqlContext.sql("select >>> count(distinct("+anmtyp+")) from TEST1 ") >>> println("Calculating Metrics for Animal Type: "+anmtyp) >>> if( distinctAnmTypCount.head().getAs[Long](0) <= 10 ){ >>> println("Animal Type: "+anmtyp+" has <= 10 distinct values") >>> } else { >>> println("Animal Type: "+anmtyp+" has > 10 distinct values") >>> } >>> } >>> >>> But the problem is it is running sequentially. >>> >>> Any inputs are appreciated. Thank you. >>> >>> >>> Regards, >>> Ajay >>> >>> >>> On Tue, Oct 4, 2016 at 7:44 PM, Ajay Chander <itsche...@gmail.com> >>> wrote: >>> >>>> Hi Everyone, >>>> >>>> I have a use-case where I have two Dataframes like below, >>>> >>>> 1) First Dataframe(DF1) contains, >>>> >>>> * ANIMALS * >>>> Mammals >>>> Birds >>>> Fish >>>> Reptiles >>>> Amphibians >>>> >>>> 2) Second Dataframe(DF2) contains, >>>> >>>> * ID, Mammals, Birds, Fish, Reptiles, Amphibians * >>>> 1, Dogs, Eagle, Goldfish, NULL, Frog >>>> 2, Cats, Peacock, Guppy, Turtle, Salamander >>>> 3, Dolphins, Eagle, Zander, NULL, Frog >>>> 4, Whales, Parrot, Guppy, Snake, Frog >>>> 5, Horses, Owl, Guppy, Snake, Frog >>>> 6, Dolphins, Kingfisher, Zander, Turtle, Frog >>>> 7, Dogs, Sparrow, Goldfish, NULL, Salamander >>>> >>>> Now I want to take each row from DF1 and find out its distinct count in >>>> DF2. Example, pick Mammals from DF1 then find out count(distinct(Mammals)) >>>> from DF2 i.e. 5 >>>> >>>> DF1 has 70 distinct rows/Animal types >>>> DF2 has some million rows >>>> >>>> Whats the best way to achieve this efficiently using parallelism ? >>>> >>>> Any inputs are helpful. Thank you. >>>> >>>> Regards, >>>> Ajay >>>> >>>> >>> >