I am not sure about this. At least Hortonworks provides its distribution with 
Hive and Spark 1.6

> On 14 Mar 2016, at 09:25, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> I think the only version of Spark that works OK with Hive (Hive on Spark 
> engine) is version 1.3.1. I also get OOM from time to time and have to revert 
> using MR
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 14 March 2016 at 08:06, Sabarish Sasidharan 
>> <sabarish.sasidha...@manthan.com> wrote:
>> Which version of Spark are you using? The configuration varies by version.
>> 
>> Regards
>> Sab
>> 
>>> On Mon, Mar 14, 2016 at 10:53 AM, Prabhu Joseph 
>>> <prabhujose.ga...@gmail.com> wrote:
>>> Hi All,
>>> 
>>> A Hive Join query which runs fine and faster in MapReduce takes lot of time 
>>> with Spark and finally fails with OOM.
>>> 
>>> Query:  hivejoin.py
>>> 
>>> from pyspark import SparkContext, SparkConf
>>> from pyspark.sql import HiveContext
>>> conf = SparkConf().setAppName("Hive_Join")
>>> sc = SparkContext(conf=conf)
>>> hiveCtx = HiveContext(sc)
>>> hiveCtx.hql("INSERT OVERWRITE TABLE D select <80 columns> from A a INNER 
>>> JOIN B b ON a.item_id = b.item_id LEFT JOIN C c ON c.instance_id = 
>>> a.instance_id");
>>> results = hiveCtx.hql("SELECT COUNT(1) FROM D").collect()
>>> print results
>>> 
>>> 
>>> Data Study:
>>> 
>>> Number of Rows:
>>> 
>>> A table has 1002093508  
>>> B table has    5371668
>>> C table has          1000
>>> 
>>> No Data Skewness:
>>> 
>>> item_id in B is unique and A has multiple rows with same item_id, so after 
>>> first INNER_JOIN the result set is same 1002093508 rows
>>> 
>>> instance_id in C is unique and A has multiple rows with same instance_id 
>>> (maximum count of number of rows with same instance_id is 250)
>>>  
>>> Spark Job runs with 90 Executors each with 2cores and 6GB memory. YARN has 
>>> allotted all the requested resource immediately and no other job is running 
>>> on the 
>>> cluster.
>>> 
>>> spark.storage.memoryFraction     0.6
>>> spark.shuffle.memoryFraction     0.2
>>> 
>>> Stage 2 - reads data from Hadoop, Tasks has NODE_LOCAL and shuffle write 
>>> 500GB of intermediate data
>>> 
>>> Stage 3 - does shuffle read of 500GB data, tasks has PROCESS_LOCAL and 
>>> output of 400GB is shuffled 
>>> 
>>> Stage 4 - tasks fails with OOM on reading the shuffled output data when it 
>>> reached 40GB data itself
>>> 
>>> First of all, what kind of Hive queries when run on Spark gets a better 
>>> performance than Mapreduce. And what are the hive queries that won't perform
>>> well in Spark.
>>>  
>>> How to calculate the optimal Heap for Executor Memory and the number of 
>>> executors for given input data size. We don't specify Spark Executors to 
>>> cache any data. But how come Stage 3 tasks says PROCESS_LOCAL. Why Stage 4 
>>> is failing immediately 
>>> when it has just read 40GB data, is it caching data in Memory. 
>>> 
>>> And in a Spark job, some stage will need lot of memory for shuffle and some 
>>> need lot of memory for cache. So, when a Spark Executor has lot of memory 
>>> available 
>>> for cache and does not use the cache but when there is a need to do lot of 
>>> shuffle, will executors only use the shuffle fraction which is set for 
>>> doing shuffle or will it use 
>>> the free memory available for cache as well.
>>> 
>>> 
>>> Thanks,
>>> Prabhu Joseph
>> 
>> 
>> 
>> -- 
>> 
>> Architect - Big Data
>> Ph: +91 99805 99458
>> 
>> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
>> India ICT)
>> +++
> 

Reply via email to