Re: Developer API & plugins for Hive & Hadoop ?

Sandy Ryza Thu, 13 Aug 2015 16:57:42 -0700

Hi Tom,

Not sure how much this helps, but are you aware that you can build Spark
with the -Phadoop-provided profile to avoid packaging Hadoop dependencies
in the assembly jar?


-Sandy

On Fri, Aug 14, 2015 at 6:08 AM, Thomas Dudziak <tom...@gmail.com> wrote:

> Unfortunately it doesn't because our version of Hive has different syntax
> elements and thus I need to patch them in (and a few other minor things).
> It would be great if there would be a developer api on a somewhat higher
> level.
>
> On Thu, Aug 13, 2015 at 2:19 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> I believe for Hive, there is already a client interface that can be used
>> to build clients for different Hive metastores. That should also work for
>> your heavily forked one.
>>
>> For Hadoop, it is definitely a bigger project to refactor. A good way to
>> start evaluating this is to list what needs to be changed. Maybe you can
>> start by telling us what you need to change for every upgrade? Feel free to
>> email me in private if this is sensitive and you don't want to share in a
>> public list.
>>
>>
>>
>>
>>
>>
>> On Thu, Aug 13, 2015 at 2:01 PM, Thomas Dudziak <tom...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have asked this before but didn't receive any comments, but with the
>>> impending release of 1.5 I wanted to bring this up again.
>>> Right now, Spark is very tightly coupled with OSS Hive & Hadoop which
>>> causes me a lot of work every time there is a new version because I don't
>>> run OSS Hive/Hadoop versions (and before you ask, I can't).
>>>
>>> My question is, does Spark need to be so tightly coupled with these two
>>> ? Or put differently, would it be possible to introduce a developer API
>>> between Spark (up and including e.g. SqlContext) and Hadoop (for HDFS bits)
>>> and Hive (e.g. HiveContext and beyond) and move the actual Hadoop & Hive
>>> dependencies into plugins (e.g. separate maven modules)?
>>> This would allow me to easily maintain my own Hive/Hadoop-ish
>>> integration with our internal systems without ever having to touch Spark
>>> code.
>>> I expect this could also allow for instance Hadoop vendors to provide
>>> their own, more optimized implementations without Spark having to know
>>> about them.
>>>
>>> cheers,
>>> Tom
>>>
>>
>>
>

Re: Developer API & plugins for Hive & Hadoop ?

Reply via email to