Hi Ayan,

My Schema for DF2 is fixed but it has around 420 columns (70 Animal type
columns and 350 other columns).

Thanks,
Ajay

On Wed, Oct 5, 2016 at 10:37 AM, ayan guha <guha.a...@gmail.com> wrote:

> Is your schema for df2 is fixed? ie do you have 70 category columns?
>
> On Thu, Oct 6, 2016 at 12:50 AM, Daniel Siegmann <
> dsiegm...@securityscorecard.io> wrote:
>
>> I think it's fine to read animal types locally because there are only 70
>> of them. It's just that you want to execute the Spark actions in parallel.
>> The easiest way to do that is to have only a single action.
>>
>> Instead of grabbing the result right away, I would just add a column for
>> the animal type and union the datasets for the animal types. Something like
>> this (not sure if the syntax is correct):
>>
>> val animalCounts: DataFrame = animalTypes.map { anmtyp =>
>>     sqlContext.sql("select lit("+anmtyp+") as animal_type,
>> count(distinct("+anmtyp+")) from TEST1 ")
>> }.reduce(_.union(_))
>>
>> animalCounts.foreach( /* print the output */ )
>>
>> On Wed, Oct 5, 2016 at 12:42 AM, Daniel <daniel.ti...@gmail.com> wrote:
>>
>>> First of all, if you want to read a txt file in Spark, you should use
>>> sc.textFile, because you are using "Source.fromFile", so you are reading it
>>> with Scala standard api, so it will be read sequentially.
>>>
>>> Furthermore you are going to need create a schema if you want to use
>>> dataframes.
>>>
>>> El 5/10/2016 1:53, "Ajay Chander" <itsche...@gmail.com> escribió:
>>>
>>>> Right now, I am doing it like below,
>>>>
>>>> import scala.io.Source
>>>>
>>>> val animalsFile = "/home/ajay/dataset/animal_types.txt"
>>>> val animalTypes = Source.fromFile(animalsFile).getLines.toArray
>>>>
>>>> for ( anmtyp <- animalTypes ) {
>>>>       val distinctAnmTypCount = sqlContext.sql("select
>>>> count(distinct("+anmtyp+")) from TEST1 ")
>>>>       println("Calculating Metrics for Animal Type: "+anmtyp)
>>>>       if( distinctAnmTypCount.head().getAs[Long](0) <= 10 ){
>>>>         println("Animal Type: "+anmtyp+" has <= 10 distinct values")
>>>>       } else {
>>>>         println("Animal Type: "+anmtyp+" has > 10 distinct values")
>>>>       }
>>>>     }
>>>>
>>>> But the problem is it is running sequentially.
>>>>
>>>> Any inputs are appreciated. Thank you.
>>>>
>>>>
>>>> Regards,
>>>> Ajay
>>>>
>>>>
>>>> On Tue, Oct 4, 2016 at 7:44 PM, Ajay Chander <itsche...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> I have a use-case where I have two Dataframes like below,
>>>>>
>>>>> 1) First Dataframe(DF1) contains,
>>>>>
>>>>> *    ANIMALS    *
>>>>> Mammals
>>>>> Birds
>>>>> Fish
>>>>> Reptiles
>>>>> Amphibians
>>>>>
>>>>> 2) Second Dataframe(DF2) contains,
>>>>>
>>>>> *    ID, Mammals, Birds, Fish, Reptiles, Amphibians    *
>>>>> 1,      Dogs,      Eagle,      Goldfish,      NULL,      Frog
>>>>> 2,      Cats,      Peacock,      Guppy,     Turtle,      Salamander
>>>>> 3,      Dolphins,      Eagle,      Zander,      NULL,      Frog
>>>>> 4,      Whales,      Parrot,      Guppy,      Snake,      Frog
>>>>> 5,      Horses,      Owl,      Guppy,      Snake,      Frog
>>>>> 6,      Dolphins,      Kingfisher,      Zander,      Turtle,      Frog
>>>>> 7,      Dogs,      Sparrow,      Goldfish,      NULL,      Salamander
>>>>>
>>>>> Now I want to take each row from DF1 and find out its distinct count
>>>>> in DF2. Example, pick Mammals from DF1 then find out
>>>>> count(distinct(Mammals)) from DF2 i.e. 5
>>>>>
>>>>> DF1 has 70 distinct rows/Animal types
>>>>> DF2 has some million rows
>>>>>
>>>>> Whats the best way to achieve this efficiently using parallelism ?
>>>>>
>>>>> Any inputs are helpful. Thank you.
>>>>>
>>>>> Regards,
>>>>> Ajay
>>>>>
>>>>>
>>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Reply via email to