[jira] [Commented] (HIVE-7292) Hive on Spark

Kiran Lonikar (JIRA) Wed, 12 Nov 2014 21:39:56 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209307#comment-14209307
 ]


Kiran Lonikar commented on HIVE-7292:
-------------------------------------

Sorry, I have not looked at the code, but want to know how is the RDD 
structured? is it columnar? I am specifically interested for ORC, RC, Parquet 
files about how you preserve their columnar structure. RDD by nature is row 
wise and the SchemaRDD more specifically so.

The spark sql component uses SchemaRDD which is row wise. Just to be clear, I 
am not reporting any problems with this JIRA. I am interested to know the 
implementation.

I think columnar structure has its advantages and thats what hive vectorization 
did (https://issues.apache.org/jira/browse/HIVE-4160). The earlier SQL 
implementation shark also had some kind of columnar structure. I am not sure 
this spark on hive is preserving it.


> Hive on Spark
> -------------
>
>                 Key: HIVE-7292
>                 URL: https://issues.apache.org/jira/browse/HIVE-7292
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>              Labels: Spark-M1, Spark-M2, Spark-M3, Spark-M4, Spark-M5
>         Attachments: Hive-on-Spark.pdf
>
>
> Spark as an open-source data analytics cluster computing framework has gained 
> significant momentum recently. Many Hive users already have Spark installed 
> as their computing backbone. To take advantages of Hive, they still need to 
> have either MapReduce or Tez on their cluster. This initiative will provide 
> user a new alternative so that those user can consolidate their backend. 
> Secondly, providing such an alternative further increases Hive's adoption as 
> it exposes Spark users  to a viable, feature-rich de facto standard SQL tools 
> on Hadoop.
> Finally, allowing Hive to run on Spark also has performance benefits. Hive 
> queries, especially those involving multiple reducer stages, will run faster, 
> thus improving user experience as Tez does.
> This is an umbrella JIRA which will cover many coming subtask. Design doc 
> will be attached here shortly, and will be on the wiki as well. Feedback from 
> the community is greatly appreciated!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7292) Hive on Spark

Reply via email to