Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

Patrick Wendell Sun, 08 Mar 2015 16:46:57 -0700

I think it's important to separate the goals from the implementation.
I agree with Matei on the goal - I think the goal needs to be to allow
people to download Apache Spark and use it with CDH, HDP, MapR,
whatever... This is the whole reason why HDFS and YARN have stable
API's, so that other projects can build on them in a way that works
across multiple versions. I wouldn't want to force users to upgrade
according only to some vendor timetable, that doesn't seem from the
ASF perspective like a good thing for the project. If users want to
get packages from Bigtop, or the vendors, that's totally fine too.


My point earlier was - I am not sure we are actually accomplishing
that goal now, because I've heard in some cases our "Hadoop 2.X"
packages actually don't work on certain distributions, even those that
are based on that Hadoop version. So one solution is to move towards
"bring your own Hadoop" binaries and have users just set HADOOP_HOME
and maybe document any vendor-specific configs that need to be set.
That also happens to solve the "too many binaries" problem, but only
incidentally.

- Patrick

On Sun, Mar 8, 2015 at 4:07 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> Our goal is to let people use the latest Apache release even if vendors fall 
> behind or don't want to package everything, so that's why we put out releases 
> for vendors' versions. It's fairly low overhead.
>
> Matei
>
>> On Mar 8, 2015, at 5:56 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
>> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
>> Maven artifacts.
>>
>> Patrick I see you just commented on SPARK-5134 and will follow up
>> there. Sounds like this may accidentally not be a problem.
>>
>> On binary tarball releases, I wonder if anyone has an opinion on my
>> opinion that these shouldn't be distributed for specific Hadoop
>> *distributions* to begin with. (Won't repeat the argument here yet.)
>> That resolves this n x m explosion too.
>>
>> Vendors already provide their own distribution, yes, that's their job.
>>
>>
>> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar <ksanka...@gmail.com> wrote:
>>> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
>>> Distributions X ...
>>>
>>> May be one option is to have a minimum basic set (which I know is what we
>>> are discussing) and move the rest to spark-packages.org. There the vendors
>>> can add the latest downloads - for example when 1.4 is released, HDP can
>>> build a release of HDP Spark 1.4 bundle.
>>>
>>> Cheers
>>> <k/>
>>>
>>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell <pwend...@gmail.com> wrote:
>>>>
>>>> We probably want to revisit the way we do binaries in general for
>>>> 1.4+. IMO, something worth forking a separate thread for.
>>>>
>>>> I've been hesitating to add new binaries because people
>>>> (understandably) complain if you ever stop packaging older ones, but
>>>> on the other hand the ASF has complained that we have too many
>>>> binaries already and that we need to pare it down because of the large
>>>> volume of files. Doubling the number of binaries we produce for Scala
>>>> 2.11 seemed like it would be too much.
>>>>
>>>> One solution potentially is to actually package "Hadoop provided"
>>>> binaries and encourage users to use these by simply setting
>>>> HADOOP_HOME, or have instructions for specific distros. I've heard
>>>> that our existing packages don't work well on HDP for instance, since
>>>> there are some configuration quirks that differ from the upstream
>>>> Hadoop.
>>>>
>>>> If we cut down on the cross building for Hadoop versions, then it is
>>>> more tenable to cross build for Scala versions without exploding the
>>>> number of binaries.
>>>>
>>>> - Patrick
>>>>
>>>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>> Yeah, interesting question of what is the better default for the
>>>>> single set of artifacts published to Maven. I think there's an
>>>>> argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
>>>>> and cons discussed more at
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-5134
>>>>> https://github.com/apache/spark/pull/3917
>>>>>
>>>>> On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia <matei.zaha...@gmail.com>
>>>>> wrote:
>>>>>> +1
>>>>>>
>>>>>> Tested it on Mac OS X.
>>>>>>
>>>>>> One small issue I noticed is that the Scala 2.11 build is using Hadoop
>>>>>> 1 without Hive, which is kind of weird because people will more likely 
>>>>>> want
>>>>>> Hadoop 2 with Hive. So it would be good to publish a build for that
>>>>>> configuration instead. We can do it if we do a new RC, or it might be 
>>>>>> that
>>>>>> binary builds may not need to be voted on (I forgot the details there).
>>>>>>
>>>>>> Matei
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

Reply via email to