This link may help:
https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html

Spark 1.6 had improved the CatesianProduct, you should turn of auto
broadcast and go with CatesianProduct in 1.6

On Mon, Feb 22, 2016 at 1:45 AM, Mohannad Ali <man...@gmail.com> wrote:
> Hello everyone,
>
> I'm working with Tamara and I wanted to give you guys an update on the
> issue:
>
> 1. Here is the output of .explain():
>>
>> Project
>> [sk_customer#0L,customer_id#1L,country#2,email#3,birthdate#4,gender#5,fk_created_at_date#6,age_range#7,first_name#8,last_name#9,inserted_at#10L,updated_at#11L,customer_id#25L
>> AS new_customer_id#38L,country#24 AS new_country#39,email#26 AS
>> new_email#40,birthdate#29 AS new_birthdate#41,gender#31 AS
>> new_gender#42,fk_created_at_date#32 AS
>> new_fk_created_at_date#43,age_range#30 AS new_age_range#44,first_name#27 AS
>> new_first_name#45,last_name#28 AS new_last_name#46]
>>  BroadcastNestedLoopJoin BuildLeft, LeftOuter, Some((((customer_id#1L =
>> customer_id#25L) || (isnull(customer_id#1L) && isnull(customer_id#25L))) &&
>> ((country#2 = country#24) || (isnull(country#2) && isnull(country#24)))))
>>   Scan
>> PhysicalRDD[country#24,customer_id#25L,email#26,first_name#27,last_name#28,birthdate#29,age_range#30,gender#31,fk_created_at_date#32]
>>   Scan
>> ParquetRelation[hdfs:///databases/dimensions/customer_dimension][sk_customer#0L,customer_id#1L,country#2,email#3,birthdate#4,gender#5,fk_created_at_date#6,age_range#7,first_name#8,last_name#9,inserted_at#10L,updated_at#11L]
>
>
> 2. Setting spark.sql.autoBroadcastJoinThreshold=-1 didn't make a difference.
> It still hangs indefinitely.
> 3. We are using Spark 1.5.2
> 4. We tried running this with 4 executors, 9 executors, and even in local
> mode with master set to "local[4]". The issue still persists in all cases.
> 5. Even without trying to cache any of the dataframes this issue still
> happens,.
> 6. We have about 200 partitions.
>
> Any help would be appreciated!
>
> Best Regards,
> Mo
>
> On Sun, Feb 21, 2016 at 8:39 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>>
>> Sorry,
>>
>> please include the following questions to the list above:
>>
>> the SPARK version?
>> whether you are using RDD or DataFrames?
>> is the code run locally or in SPARK Cluster mode or in AWS EMR?
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Sun, Feb 21, 2016 at 7:37 PM, Gourav Sengupta
>> <gourav.sengu...@gmail.com> wrote:
>>>
>>> Hi Tamara,
>>>
>>> few basic questions first.
>>>
>>> How many executors are you using?
>>> Is the data getting all cached into the same executor?
>>> How many partitions do you have of the data?
>>> How many fields are you trying to use in the join?
>>>
>>> If you need any help in finding answer to these questions please let me
>>> know. From what I reckon joins like yours should not take more than a few
>>> milliseconds.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Fri, Feb 19, 2016 at 5:31 PM, Tamara Mendt <t...@hellofresh.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I am running a Spark job that gets stuck attempting to join two
>>>> dataframes. The dataframes are not very large, one is about 2 M rows, and
>>>> the other a couple of thousand rows and the resulting joined dataframe
>>>> should be about the same size as the smaller dataframe. I have tried
>>>> triggering execution of the join using the 'first' operator, which as far 
>>>> as
>>>> I understand would not require processing the entire resulting dataframe
>>>> (maybe I am mistaken though). The Spark UI is not telling me anything, just
>>>> showing the task to be stuck.
>>>>
>>>> When I run the exact same job on a slightly smaller dataset it works
>>>> without hanging.
>>>>
>>>> I have used the same environment to run joins on much larger dataframes,
>>>> so I am confused as to why in this particular case my Spark job is just
>>>> hanging. I have also tried running the same join operation using pyspark on
>>>> two 2 Million row dataframes (exactly like the one I am trying to join in
>>>> the job that gets stuck) and it runs succesfully.
>>>>
>>>> I have tried caching the joined dataframe to see how much memory it is
>>>> requiring but the job gets stuck on this action too. I have also tried 
>>>> using
>>>> persist to memory and disk on the join, and the job seems to be stuck all
>>>> the same.
>>>>
>>>> Any help as to where to look for the source of the problem would be much
>>>> appreciated.
>>>>
>>>> Cheers,
>>>>
>>>> Tamara
>>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to