I believe for Hive, there is already a client interface that can be used to
build clients for different Hive metastores. That should also work for your
heavily forked one.

For Hadoop, it is definitely a bigger project to refactor. A good way to
start evaluating this is to list what needs to be changed. Maybe you can
start by telling us what you need to change for every upgrade? Feel free to
email me in private if this is sensitive and you don't want to share in a
public list.






On Thu, Aug 13, 2015 at 2:01 PM, Thomas Dudziak <tom...@gmail.com> wrote:

> Hi,
>
> I have asked this before but didn't receive any comments, but with the
> impending release of 1.5 I wanted to bring this up again.
> Right now, Spark is very tightly coupled with OSS Hive & Hadoop which
> causes me a lot of work every time there is a new version because I don't
> run OSS Hive/Hadoop versions (and before you ask, I can't).
>
> My question is, does Spark need to be so tightly coupled with these two ?
> Or put differently, would it be possible to introduce a developer API
> between Spark (up and including e.g. SqlContext) and Hadoop (for HDFS bits)
> and Hive (e.g. HiveContext and beyond) and move the actual Hadoop & Hive
> dependencies into plugins (e.g. separate maven modules)?
> This would allow me to easily maintain my own Hive/Hadoop-ish integration
> with our internal systems without ever having to touch Spark code.
> I expect this could also allow for instance Hadoop vendors to provide
> their own, more optimized implementations without Spark having to know
> about them.
>
> cheers,
> Tom
>

Reply via email to