Hi You can "generate" a sql through program. Python Example:
>>> schema ['id', 'Mammals', 'Birds', 'Fish', 'Reptiles', 'Amphibians'] >>> >>> count_stmt=[ "count(distinct <tag>) as <tag>".replace("<tag>",x) for x in schema] >>> count_stmt ['count(distinct id) as id', 'count(distinct Mammals) as Mammals', 'count(distinct Birds) as Birds', 'count(distinct Fish) as Fish', 'count(distinct Reptiles) as Reptiles', 'count(distinct Amphibians) as Amphibians'] >>> sql = "select " >>> sql = sql + ", ".join(count_stmt) + " from t " >>> sql 'select count(distinct id) as id, count(distinct Mammals) as Mammals, count(distinct Birds) as Birds, count(distinct Fish) as Fish, count(distinct Reptiles) as Reptiles, count(distinct Amphibians) as Amphibians from t ' >>> >>> df.show() +---+--------+----------+--------+--------+----------+ | id| Mammals| Birds| Fish|Reptiles|Amphibians| +---+--------+----------+--------+--------+----------+ | 1| Dogs| Eagle|Goldfish| NULL| Frog| | 2| Cats| Peacock| Guppy| Turtle|Salamander| | 3|Dolphins| Eagle| Zander| NULL| Frog| | 4| Whales| Parrot| Guppy| Snake| Frog| | 5| Horses| Owl| Guppy| Snake| Frog| | 6|Dolphins|Kingfisher| Zander| Turtle| Frog| | 7| Dogs| Sparrow|Goldfish| NULL|Salamander| +---+--------+----------+--------+--------+----------+ >>> cnr = sqlContext.sql(sql) >>> cnr.show() +---+-------+-----+----+--------+----------+ | id|Mammals|Birds|Fish|Reptiles|Amphibians| +---+-------+-----+----+--------+----------+ | 7| 5| 6| 3| 3| 2| +---+-------+-----+----+--------+----------+ Does that help? On Thu, Oct 6, 2016 at 1:44 AM, Ajay Chander <itsche...@gmail.com> wrote: > Hi Ayan, > > My Schema for DF2 is fixed but it has around 420 columns (70 Animal type > columns and 350 other columns). > > Thanks, > Ajay > > On Wed, Oct 5, 2016 at 10:37 AM, ayan guha <guha.a...@gmail.com> wrote: > >> Is your schema for df2 is fixed? ie do you have 70 category columns? >> >> On Thu, Oct 6, 2016 at 12:50 AM, Daniel Siegmann < >> dsiegm...@securityscorecard.io> wrote: >> >>> I think it's fine to read animal types locally because there are only 70 >>> of them. It's just that you want to execute the Spark actions in parallel. >>> The easiest way to do that is to have only a single action. >>> >>> Instead of grabbing the result right away, I would just add a column for >>> the animal type and union the datasets for the animal types. Something like >>> this (not sure if the syntax is correct): >>> >>> val animalCounts: DataFrame = animalTypes.map { anmtyp => >>> sqlContext.sql("select lit("+anmtyp+") as animal_type, >>> count(distinct("+anmtyp+")) from TEST1 ") >>> }.reduce(_.union(_)) >>> >>> animalCounts.foreach( /* print the output */ ) >>> >>> On Wed, Oct 5, 2016 at 12:42 AM, Daniel <daniel.ti...@gmail.com> wrote: >>> >>>> First of all, if you want to read a txt file in Spark, you should use >>>> sc.textFile, because you are using "Source.fromFile", so you are reading it >>>> with Scala standard api, so it will be read sequentially. >>>> >>>> Furthermore you are going to need create a schema if you want to use >>>> dataframes. >>>> >>>> El 5/10/2016 1:53, "Ajay Chander" <itsche...@gmail.com> escribió: >>>> >>>>> Right now, I am doing it like below, >>>>> >>>>> import scala.io.Source >>>>> >>>>> val animalsFile = "/home/ajay/dataset/animal_types.txt" >>>>> val animalTypes = Source.fromFile(animalsFile).getLines.toArray >>>>> >>>>> for ( anmtyp <- animalTypes ) { >>>>> val distinctAnmTypCount = sqlContext.sql("select >>>>> count(distinct("+anmtyp+")) from TEST1 ") >>>>> println("Calculating Metrics for Animal Type: "+anmtyp) >>>>> if( distinctAnmTypCount.head().getAs[Long](0) <= 10 ){ >>>>> println("Animal Type: "+anmtyp+" has <= 10 distinct values") >>>>> } else { >>>>> println("Animal Type: "+anmtyp+" has > 10 distinct values") >>>>> } >>>>> } >>>>> >>>>> But the problem is it is running sequentially. >>>>> >>>>> Any inputs are appreciated. Thank you. >>>>> >>>>> >>>>> Regards, >>>>> Ajay >>>>> >>>>> >>>>> On Tue, Oct 4, 2016 at 7:44 PM, Ajay Chander <itsche...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Everyone, >>>>>> >>>>>> I have a use-case where I have two Dataframes like below, >>>>>> >>>>>> 1) First Dataframe(DF1) contains, >>>>>> >>>>>> * ANIMALS * >>>>>> Mammals >>>>>> Birds >>>>>> Fish >>>>>> Reptiles >>>>>> Amphibians >>>>>> >>>>>> 2) Second Dataframe(DF2) contains, >>>>>> >>>>>> * ID, Mammals, Birds, Fish, Reptiles, Amphibians * >>>>>> 1, Dogs, Eagle, Goldfish, NULL, Frog >>>>>> 2, Cats, Peacock, Guppy, Turtle, Salamander >>>>>> 3, Dolphins, Eagle, Zander, NULL, Frog >>>>>> 4, Whales, Parrot, Guppy, Snake, Frog >>>>>> 5, Horses, Owl, Guppy, Snake, Frog >>>>>> 6, Dolphins, Kingfisher, Zander, Turtle, Frog >>>>>> 7, Dogs, Sparrow, Goldfish, NULL, Salamander >>>>>> >>>>>> Now I want to take each row from DF1 and find out its distinct count >>>>>> in DF2. Example, pick Mammals from DF1 then find out >>>>>> count(distinct(Mammals)) from DF2 i.e. 5 >>>>>> >>>>>> DF1 has 70 distinct rows/Animal types >>>>>> DF2 has some million rows >>>>>> >>>>>> Whats the best way to achieve this efficiently using parallelism ? >>>>>> >>>>>> Any inputs are helpful. Thank you. >>>>>> >>>>>> Regards, >>>>>> Ajay >>>>>> >>>>>> >>>>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > > -- Best Regards, Ayan Guha