Hi Tom, Not sure how much this helps, but are you aware that you can build Spark with the -Phadoop-provided profile to avoid packaging Hadoop dependencies in the assembly jar?
-Sandy On Fri, Aug 14, 2015 at 6:08 AM, Thomas Dudziak <tom...@gmail.com> wrote: > Unfortunately it doesn't because our version of Hive has different syntax > elements and thus I need to patch them in (and a few other minor things). > It would be great if there would be a developer api on a somewhat higher > level. > > On Thu, Aug 13, 2015 at 2:19 PM, Reynold Xin <r...@databricks.com> wrote: > >> I believe for Hive, there is already a client interface that can be used >> to build clients for different Hive metastores. That should also work for >> your heavily forked one. >> >> For Hadoop, it is definitely a bigger project to refactor. A good way to >> start evaluating this is to list what needs to be changed. Maybe you can >> start by telling us what you need to change for every upgrade? Feel free to >> email me in private if this is sensitive and you don't want to share in a >> public list. >> >> >> >> >> >> >> On Thu, Aug 13, 2015 at 2:01 PM, Thomas Dudziak <tom...@gmail.com> wrote: >> >>> Hi, >>> >>> I have asked this before but didn't receive any comments, but with the >>> impending release of 1.5 I wanted to bring this up again. >>> Right now, Spark is very tightly coupled with OSS Hive & Hadoop which >>> causes me a lot of work every time there is a new version because I don't >>> run OSS Hive/Hadoop versions (and before you ask, I can't). >>> >>> My question is, does Spark need to be so tightly coupled with these two >>> ? Or put differently, would it be possible to introduce a developer API >>> between Spark (up and including e.g. SqlContext) and Hadoop (for HDFS bits) >>> and Hive (e.g. HiveContext and beyond) and move the actual Hadoop & Hive >>> dependencies into plugins (e.g. separate maven modules)? >>> This would allow me to easily maintain my own Hive/Hadoop-ish >>> integration with our internal systems without ever having to touch Spark >>> code. >>> I expect this could also allow for instance Hadoop vendors to provide >>> their own, more optimized implementations without Spark having to know >>> about them. >>> >>> cheers, >>> Tom >>> >> >> >