Re: Utilize newer hadoop releases WAS: [VOTE] Release Apache Spark 1.0.2 (RC1)

Sean Owen Mon, 28 Jul 2014 02:10:30 -0700

Right, the scenario is, for example, that a class is added in release
2.5.0, but has been back-ported to a 2.4.1-based release. 2.4.1 isn't
missing anything from 2.4.1. But a version of "2.4.1" doesn't tell you
whether or not the class is there reliably.


By the way, I just found there is already such a class,
org.apache.hadoop.util.VersionInfo:

https://github.com/apache/hadoop-common/blob/release-2.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/VersionInfo.java

It appears to have been around for a long time. Theoretical problems
aside, there may be cases where querying the version is a fine and
reliable solution.

On Jul 28, 2014 12:54 AM, "Matei Zaharia" <matei.zaha...@gmail.com> wrote:
>
> We could also do this, though it would be great if the Hadoop project 
> provided this version number as at least a baseline. It's up to distributors 
> to decide which version they report but I imagine they won't remove stuff 
> that's in the reported version number.
>
> Matei
>
> On Jul 27, 2014, at 1:57 PM, Sean Owen <so...@cloudera.com> wrote:
>
> > Good idea, although it gets difficult in the context of multiple
> > distributions. Say change X is not present in version A, but present
> > in version B. If you depend on X, what version can you look for to
> > detect it? The distribution will return "A" or "A+X" or somesuch, but
> > testing for "A" will give an incorrect answer, and the code can't be
> > expected to look for everyone's "A+X" versions. Actually inspecting
> > the code is more robust if a bit messier.
> >
> > On Sun, Jul 27, 2014 at 9:50 PM, Matei Zaharia <matei.zaha...@gmail.com> 
> > wrote:
> >> For this particular issue, it would be good to know if Hadoop provides an 
> >> API to determine the Hadoop version. If not, maybe that can be added to 
> >> Hadoop in its next release, and we can check for it with reflection. We 
> >> recently added a SparkContext.version() method in Spark to let you tell 
> >> the version.
> >>
> >> Matei
> >>
> >> On Jul 27, 2014, at 12:19 PM, Patrick Wendell <pwend...@gmail.com> wrote:
> >>
> >>> Hey Ted,
> >>>
> >>> We always intend Spark to work with the newer Hadoop versions and
> >>> encourage Spark users to use the newest Hadoop versions for best
> >>> performance.
> >>>
> >>> We do try to be liberal in terms of supporting older versions as well.
> >>> This is because many people run older HDFS versions and we want Spark
> >>> to read and write data from them. So far we've been willing to do this
> >>> despite some maintenance cost.
> >>>
> >>> The reason is that for many users it's very expensive to do a
> >>> whole-sale upgrade of HDFS, but trying out new versions of Spark is
> >>> much easier. For instance, some of the largest scale Spark users run
> >>> fairly old or forked HDFS versions.
> >>>
> >>> - Patrick
> >>>
> >>> On Sun, Jul 27, 2014 at 12:01 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> >>>> Thanks for replying, Patrick.
> >>>>
> >>>> The intention of my first email was for utilizing newer hadoop releases 
> >>>> for
> >>>> their bug fixes. I am still looking for clean way of passing hadoop 
> >>>> release
> >>>> version number to individual classes.
> >>>> Using newer hadoop releases would encourage pushing bug fixes / new
> >>>> features upstream. Ultimately Spark code would become cleaner.
> >>>>
> >>>> Cheers
> >>>>
> >>>> On Sun, Jul 27, 2014 at 8:52 AM, Patrick Wendell <pwend...@gmail.com> 
> >>>> wrote:
> >>>>
> >>>>> Ted - technically I think you are correct, although I wouldn't
> >>>>> recommend disabling this lock. This lock is not expensive (acquired
> >>>>> once per task, as are many other locks already). Also, we've seen some
> >>>>> cases where Hadoop concurrency bugs ended up requiring multiple fixes
> >>>>> - concurrency of client access is not well tested in the Hadoop
> >>>>> codebase since most of the Hadoop tools to not use concurrent access.
> >>>>> So in general it's good to be conservative in what we expect of the
> >>>>> Hadoop client libraries.
> >>>>>
> >>>>> If you'd like to discuss this further, please fork a new thread, since
> >>>>> this is a vote thread. Thanks!
> >>>>>
> >>>>> On Fri, Jul 25, 2014 at 10:14 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> >>>>>> HADOOP-10456 is fixed in hadoop 2.4.1
> >>>>>>
> >>>>>> Does this mean that synchronization
> >>>>>> on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK can be bypassed for 
> >>>>>> hadoop
> >>>>>> 2.4.1 ?
> >>>>>>
> >>>>>> Cheers
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Jul 25, 2014 at 6:00 PM, Patrick Wendell <pwend...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> The most important issue in this release is actually an ammendment to
> >>>>>>> an earlier fix. The original fix caused a deadlock which was a
> >>>>>>> regression from 1.0.0->1.0.1:
> >>>>>>>
> >>>>>>> Issue:
> >>>>>>> https://issues.apache.org/jira/browse/SPARK-1097
> >>>>>>>
> >>>>>>> 1.0.1 Fix:
> >>>>>>> https://github.com/apache/spark/pull/1273/files (had a deadlock)
> >>>>>>>
> >>>>>>> 1.0.2 Fix:
> >>>>>>> https://github.com/apache/spark/pull/1409/files
> >>>>>>>
> >>>>>>> I failed to correctly label this on JIRA, but I've updated it!
> >>>>>>>
> >>>>>>> On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust
> >>>>>>> <mich...@databricks.com> wrote:
> >>>>>>>> That query is looking at "Fix Version" not "Target Version".  The 
> >>>>>>>> fact
> >>>>>>> that
> >>>>>>>> the first one is still open is only because the bug is not resolved 
> >>>>>>>> in
> >>>>>>>> master.  It is fixed in 1.0.2.  The second one is partially fixed in
> >>>>>>> 1.0.2,
> >>>>>>>> but is not worth blocking the release for.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas <
> >>>>>>>> nicholas.cham...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> TD, there are a couple of unresolved issues slated for 1.0.2
> >>>>>>>>> <
> >>>>>>>>>
> >>>>>>>
> >>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
> >>>>>>>>>> .
> >>>>>>>>> Should they be edited somehow?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das <
> >>>>>>>>> tathagata.das1...@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Please vote on releasing the following candidate as Apache Spark
> >>>>>>> version
> >>>>>>>>>> 1.0.2.
> >>>>>>>>>>
> >>>>>>>>>> This release fixes a number of bugs in Spark 1.0.1.
> >>>>>>>>>> Some of the notable ones are
> >>>>>>>>>> - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix
> >>>>> for
> >>>>>>>>>> SPARK-1199. The fix was reverted for 1.0.2.
> >>>>>>>>>> - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
> >>>>>>>>>> HDFS CSV file.
> >>>>>>>>>> The full list is at http://s.apache.org/9NJ
> >>>>>>>>>>
> >>>>>>>>>> The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
> >>>>>>>>>>
> >>>>>>>>>> The release files, including signatures, digests, etc can be found
> >>>>> at:
> >>>>>>>>>> http://people.apache.org/~tdas/spark-1.0.2-rc1/
> >>>>>>>>>>
> >>>>>>>>>> Release artifacts are signed with the following key:
> >>>>>>>>>> https://people.apache.org/keys/committer/tdas.asc
> >>>>>>>>>>
> >>>>>>>>>> The staging repository for this release can be found at:
> >>>>>>>>>>
> >>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1024/
> >>>>>>>>>>
> >>>>>>>>>> The documentation corresponding to this release can be found at:
> >>>>>>>>>> http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
> >>>>>>>>>>
> >>>>>>>>>> Please vote on releasing this package as Apache Spark 1.0.2!
> >>>>>>>>>>
> >>>>>>>>>> The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
> >>>>>>>>>> a majority of at least 3 +1 PMC votes are cast.
> >>>>>>>>>> [ ] +1 Release this package as Apache Spark 1.0.2
> >>>>>>>>>> [ ] -1 Do not release this package because ...
> >>>>>>>>>>
> >>>>>>>>>> To learn more about Apache Spark, please see
> >>>>>>>>>> http://spark.apache.org/
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
>

Re: Utilize newer hadoop releases WAS: [VOTE] Release Apache Spark 1.0.2 (RC1)

Reply via email to