I think the only version of Spark that works OK with Hive (Hive on Spark
engine) is version 1.3.1. I also get OOM from time to time and have to
revert using MR

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 March 2016 at 08:06, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> Which version of Spark are you using? The configuration varies by version.
>
> Regards
> Sab
>
> On Mon, Mar 14, 2016 at 10:53 AM, Prabhu Joseph <
> prabhujose.ga...@gmail.com> wrote:
>
>> Hi All,
>>
>> A Hive Join query which runs fine and faster in MapReduce takes lot of
>> time with Spark and finally fails with OOM.
>>
>> *Query:  hivejoin.py*
>>
>> from pyspark import SparkContext, SparkConf
>> from pyspark.sql import HiveContext
>> conf = SparkConf().setAppName("Hive_Join")
>> sc = SparkContext(conf=conf)
>> hiveCtx = HiveContext(sc)
>> hiveCtx.hql("INSERT OVERWRITE TABLE D select <80 columns> from A a INNER
>> JOIN B b ON a.item_id = b.item_id LEFT JOIN C c ON c.instance_id =
>> a.instance_id");
>> results = hiveCtx.hql("SELECT COUNT(1) FROM D").collect()
>> print results
>>
>>
>> *Data Study:*
>>
>> Number of Rows:
>>
>> A table has 1002093508
>> B table has    5371668
>> C table has          1000
>>
>> No Data Skewness:
>>
>> item_id in B is unique and A has multiple rows with same item_id, so
>> after first INNER_JOIN the result set is same 1002093508 rows
>>
>> instance_id in C is unique and A has multiple rows with same instance_id
>> (maximum count of number of rows with same instance_id is 250)
>>
>> Spark Job runs with 90 Executors each with 2cores and 6GB memory. YARN
>> has allotted all the requested resource immediately and no other job is
>> running on the
>> cluster.
>>
>> spark.storage.memoryFraction     0.6
>> spark.shuffle.memoryFraction     0.2
>>
>> Stage 2 - reads data from Hadoop, Tasks has NODE_LOCAL and shuffle write
>> 500GB of intermediate data
>>
>> Stage 3 - does shuffle read of 500GB data, tasks has PROCESS_LOCAL and
>> output of 400GB is shuffled
>>
>> Stage 4 - tasks fails with OOM on reading the shuffled output data when
>> it reached 40GB data itself
>>
>> First of all, what kind of Hive queries when run on Spark gets a better
>> performance than Mapreduce. And what are the hive queries that won't perform
>> well in Spark.
>>
>> How to calculate the optimal Heap for Executor Memory and the number of
>> executors for given input data size. We don't specify Spark Executors to
>> cache any data. But how come Stage 3 tasks says PROCESS_LOCAL. Why Stage 4
>> is failing immediately
>> when it has just read 40GB data, is it caching data in Memory.
>>
>> And in a Spark job, some stage will need lot of memory for shuffle and
>> some need lot of memory for cache. So, when a Spark Executor has lot of
>> memory available
>> for cache and does not use the cache but when there is a need to do lot
>> of shuffle, will executors only use the shuffle fraction which is set for
>> doing shuffle or will it use
>> the free memory available for cache as well.
>>
>>
>> Thanks,
>> Prabhu Joseph
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> --
>
> Architect - Big Data
> Ph: +91 99805 99458
>
> Manthan Systems | *Company of the year - Analytics (2014 Frost and
> Sullivan India ICT)*
> +++
>

Reply via email to