First of all, I think it's great that you're thinking about this. API stability is super important and it would be good to see Spark get on top of this.
I want to clarify a bit about Hadoop. The problem that Hadoop faces is that the Java package system isn't very flexible. If you have a method in, say, the org.apache.hadoop.hdfs.shortcircuit package that should only ever be used by the org.apache.hadoop.hdfs.client package, there is no way to express that. You have to make the method public. You can hide things by making them package-private, but that only works if your entire project is a single giant package, and that is not the road Hadoop devs wanted to go down. So a lot of internal stuff ended up being public. Once things are public, of course, they can be called by anyone. To get around this limitation, Hadoop came up with a pretty rigorous compatibility policy, discussed here: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html The basic idea is that we'd put "interface annotations" on every public class. The "Private" annotation meant that it was only supposed to be used in the project itself. "Limited-Private" was kind of the project and maybe one or two closely related projects. And "Public" was supposed to be the public API. At a finer granularity, for specific public methods, you could add the "VisibileForTesting" annotation to indicate that they were only visible to make a unit test possible. This sounds great in theory. But in practice, users often ignore the annotation and just do whatever they want. This is not because they're mustache-twirling villains, but because they have legitimate (to them) reasons. For example, HBase would often find that they could get better performance by hooking into supposedly private HDFS APIs. Of course, they could always ask HDFS to add public versions of those APIs. But that takes time, and could be contentious. In the best case, they'd have to wait for another Hadoop release to happen before HBase could benefit. From their perspective, supporting the feature on more Hadoop releases was better than supporting it on fewer, even if the latter was the "correct" way of doing things. Then of course there were the cases where there were simple oversights... there either was no interface annotation or the user of the downstream project forgot to check it. Ideally, we'd later add a @stable API and transition everyone to it. But that's much easier said than done. A lot of projects just don't want to change, because it would mean giving up compatibility with older releases without the "blessed" API. Basically, it's a tragedy of the commons. It would be much better for everyone if we all used public stable APIs and never used private or unstable ones. But each individual project feels that it can get advantages by cheating and using (or continuing to use) the private / unstable APIs. Candidly, Spark is one of those projects that continues to use deprecated and private Hadoop APIs-- mostly for compatibility reasons, as I understand. I think that the lesson learned here is that the compiler needs to be in charge of preventing people from using APIs, not an annotation. Public/private annotations "Just Don't Work." I don't know if Scala provides any mechanisms to do this beyond what Java provides. Even if not, there are probably classloader and CLASSPATH tricks that could be used to hide internals. I also think that it makes sense to put a lot of thought into APIs up front, because changing them later can be very painful. On a related note, there were definitely cases where Hadoop changed an API, and the pain outweighed the gain. There are other dimensions to compatibility... for example, Hadoop currently leaks its CLASSPATH, so that you can't easily write a MapReduce job without using the same versions of Guava (just to pick one random example) that it does. In practice, this led to a pathological fear of updating dependencies, since we didn't want to break users who needed specific version of their deps. Does Spark also expose its CLASSPATH in this way to executors? I was under the impression that it did. At some point we will also have to confront the Scala version issue. Will there be flag days where Spark jobs need to be upgraded to a new, incompatible version of Scala to run on the latest Spark? There are pros and cons, but I think users will mostly see the cons. On Thu, May 29, 2014 at 1:23 PM, Patrick Wendell <pwend...@gmail.com> wrote: > 1. Hadoop projects don't do any rigorous checking that new patches > don't break API's. Of course, the results in regular API breaks and a > poor understanding of what is a public API. > I agree with this. We should test these compatibility scenarios, and we don't. It would be awesome to do this in an automated way for Spark. > 2. In several cases it's not possible to do basic things in Hadoop > without using deprecated or private API's. > Disagree. The problem is that we have stable APIs, but users don't want to use them (they prefer the ancient API Doug Cutting wrote in 2008, because it works on some old version of Hadoop). It's hard to argue against this kind of reasoning, since (to reiterate) it's rational from the point of view of the individual. This is the problem with deprecation in general-- once you've let an API out into the wild, it's very difficult to get it back into its cage. 3. There is significant vendor fragmentation of API's. > The big difference in the last few years was that some people were creating distributions based on Hadoop 1.x and others were creating distributions based on 2.x. But nobody added vendor specific APIs (or at least I haven't heard of any). (I can't speak for MapR... since they are proprietary, I have not seen the code.) Now that Hadoop 1.x is starting to die a natural death, any differences between 2.x and 1.x are becoming less important. Sadly, Yahoo continues to use and develop 0.23, for now at least... But I think their efforts are mostly directed at backporting. They have not added divergent APIs, to my knowledge. best, Colin The main focus of the Hadoop vendors is making consistent cuts of the > core projects work together (HDFS/Pig/Hive/etc) - so API breaks are > sometimes considered "fixed" as long as the other projects work around > them (see [1]). We also regularly need to do archaeology (see [2]) and > directly interact with Hadoop committers to understand what API's are > stable and in which versions. > > One goal of Spark is to deal with the pain of inter-operating with > Hadoop so that application writers don't to. We'd like to retain the > property that if you build an application against the (well defined, > stable) Spark API's right now, you'll be able to run it across many > Hadoop vendors and versions for the entire Spark 1.X release cycle. > > Writing apps against Hadoop can be very difficult... consider how much > more engineering effort we spent maintaining YARN support than Mesos > support. There are many factors, but one is that Mesos has a single, > narrow, stable API. We've never had to make a change in Mesos due to > an API change, for several years. YARN on the other hand, there are at > least 3 YARN API's that currently exist, which are all binary > incompatible. We'd like to offer apps the ability to build against > Spark's API and just let us deal with it. > > As more vendors packaging Spark, I'd like to see us put tools in the > upstream Spark repo that do validation for vendor packages of Spark, > so that we don't end up with fragmentation. Of course, vendors can > enhance the API and are encouraged to, but we need a kernel of API's > that vendors must maintain (think POSIX) to be considered compliant > with Apache Spark. I believe some other projects like OpenStack have > done this to avoid fragmentation. > > - Patrick > > [1] https://issues.apache.org/jira/browse/MAPREDUCE-5830 > [2] > http://2.bp.blogspot.com/-GO6HF0OAFHw/UOfNEH-4sEI/AAAAAAAAAD0/dEWFFYTRgYw/s1600/output-file.png > > On Sun, May 18, 2014 at 2:13 AM, Mridul Muralidharan <mri...@gmail.com> > wrote: > > So I think I need to clarify a few things here - particularly since > > this mail went to the wrong mailing list and a much wider audience > > than I intended it for :-) > > > > > > Most of the issues I mentioned are internal implementation detail of > > spark core : which means, we can enhance them in future without > > disruption to our userbase (ability to support large number of > > input/output partitions. Note: this is of order of 100k input and > > output partitions with uniform spread of keys - very rarely seen > > outside of some crazy jobs). > > > > Some of the issues I mentioned would reqiure DeveloperApi changes - > > which are not user exposed : they would impact developer use of these > > api's - which are mostly internally provided by spark. (Like fixing > > blocks > 2G would require change to Serializer api) > > > > A smaller faction might require interface changes - note, I am > > referring specifically to configuration changes (removing/deprecating > > some) and possibly newer options to submit/env, etc - I dont envision > > any programming api change itself. > > The only api change we did was from Seq -> Iterable - which is > > actually to address some of the issues I mentioned (join/cogroup). > > > > Remaining are bugs which need to be addressed or the feature > > removed/enhanced like shuffle consolidation. > > > > There might be semantic extension of some things like OFF_HEAP storage > > level to address other computation models - but that would not have an > > impact on end user - since other options would be pluggable with > > default set to Tachyon so that there is no user expectation change. > > > > > > So will the interface possibly change ? Sure though we will try to > > keep it backwardly compatible (as we did with 1.0). > > Will the api change - other than backward compatible enhancements, > probably not. > > > > > > Regards, > > Mridul > > > > > > On Sun, May 18, 2014 at 12:11 PM, Mridul Muralidharan <mri...@gmail.com> > wrote: > >> > >> On 18-May-2014 5:05 am, "Mark Hamstra" <m...@clearstorydata.com> wrote: > >>> > >>> I don't understand. We never said that interfaces wouldn't change from > >>> 0.9 > >> > >> Agreed. > >> > >>> to 1.0. What we are committing to is stability going forward from the > >>> 1.0.0 baseline. Nobody is disputing that backward-incompatible > behavior > >>> or > >>> interface changes would be an issue post-1.0.0. The question is > whether > >> > >> The point is, how confident are we that these are the right set of > interface > >> definitions. > >> We think it is, but we could also have gone through a 0.10 to vet the > >> proposed 1.0 changes to stabilize them. > >> > >> To give examples for which we don't have solutions currently (which we > are > >> facing internally here btw, so not academic exercise) : > >> > >> - Current spark shuffle model breaks very badly as number of partitions > >> increases (input and output). > >> > >> - As number of nodes increase, the overhead per node keeps going up. > Spark > >> currently is more geared towards large memory machines; when the RAM per > >> node is modest (8 to 16 gig) but large number of them are available, it > does > >> not do too well. > >> > >> - Current block abstraction breaks as data per block goes beyond 2 gig. > >> > >> - Cogroup/join when value per key or number of keys (or both) is high > breaks > >> currently. > >> > >> - Shuffle consolidation is so badly broken it is not funny. > >> > >> - Currently there is no way of effectively leveraging accelerator > >> cards/coprocessors/gpus from spark - to do so, I suspect we will need to > >> redefine OFF_HEAP. > >> > >> - Effectively leveraging ssd is still an open question IMO when you > have mix > >> of both available. > >> > >> We have resolved some of these and looking at the rest. These are not > unique > >> to our internal usage profile, I have seen most of these asked elsewhere > >> too. > >> > >> Thankfully some of the 1.0 changes actually are geared towards helping > to > >> alleviate some of the above (Iterable change for ex), most of the rest > are > >> internal impl detail of spark core which helps a lot - but there are > cases > >> where this is not so. > >> > >> Unfortunately I don't know yet if the unresolved/uninvestigated issues > will > >> require more changes or not. > >> > >> Given this I am very skeptical of expecting current spark interfaces to > be > >> sufficient for next 1 year (forget 3) > >> > >> I understand this is an argument which can be made to never release 1.0 > :-) > >> Which is why I was ok with a 1.0 instead of 0.10 release in spite of my > >> preference. > >> > >> This is a good problem to have IMO ... People are using spark > extensively > >> and in circumstances that we did not envision : necessitating changes > even > >> to spark core. > >> > >> But the claim that 1.0 interfaces are stable is not something I buy - > they > >> are not, we will need to break them soon and cost of maintaining > backward > >> compatibility will be high. > >> > >> We just need to make an informed decision to live with that cost, not > hand > >> wave it away. > >> > >> Regards > >> Mridul > >> > >>> there is anything apparent now that is expected to require such > disruptive > >>> changes if we were to commit to the current release candidate as our > >>> guaranteed 1.0.0 baseline. > >>> > >>> > >>> On Sat, May 17, 2014 at 2:05 PM, Mridul Muralidharan > >>> <mri...@gmail.com>wrote: > >>> > >>> > I would make the case for interface stability not just api stability. > >>> > Particularly given that we have significantly changed some of our > >>> > interfaces, I want to ensure developers/users are not seeing red > flags. > >>> > > >>> > Bugs and code stability can be addressed in minor releases if found, > but > >>> > behavioral change and/or interface changes would be a much more > invasive > >>> > issue for our users. > >>> > > >>> > Regards > >>> > Mridul > >>> > On 18-May-2014 2:19 am, "Matei Zaharia" <matei.zaha...@gmail.com> > wrote: > >>> > > >>> > > As others have said, the 1.0 milestone is about API stability, not > >>> > > about > >>> > > saying "we've eliminated all bugs". The sooner you declare 1.0, the > >>> > sooner > >>> > > users can confidently build on Spark, knowing that the application > >>> > > they > >>> > > build today will still run on Spark 1.9.9 three years from now. > This > >>> > > is > >>> > > something that I've seen done badly (and experienced the effects > >>> > > thereof) > >>> > > in other big data projects, such as MapReduce and even YARN. The > >>> > > result > >>> > is > >>> > > that you annoy users, you end up with a fragmented userbase where > >>> > everyone > >>> > > is building against a different version, and you drastically slow > down > >>> > > development. > >>> > > > >>> > > With a project as fast-growing as fast-growing as Spark in > particular, > >>> > > there will be new bugs discovered and reported continuously, > >>> > > especially > >>> > in > >>> > > the non-core components. Look at the graph of # of contributors in > >>> > > time > >>> > to > >>> > > Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph; > >>> > "commits" > >>> > > changed when we started merging each patch as a single commit). > This > >>> > > is > >>> > not > >>> > > slowing down, and we need to have the culture now that we treat API > >>> > > stability and release numbers at the level expected for a 1.0 > project > >>> > > instead of having people come in and randomly change the API. > >>> > > > >>> > > I'll also note that the issues marked "blocker" were marked so by > >>> > > their > >>> > > reporters, since the reporter can set the priority. I don't > consider > >>> > stuff > >>> > > like parallelize() not partitioning ranges in the same way as other > >>> > > collections a blocker -- it's a bug, it would be good to fix it, > but it > >>> > only > >>> > > affects a small number of use cases. Of course if we find a real > >>> > > blocker > >>> > > (in particular a regression from a previous version, or a feature > >>> > > that's > >>> > > just completely broken), we will delay the release for that, but at > >>> > > some > >>> > > point you have to say "okay, this fix will go into the next > >>> > > maintenance > >>> > > release". Maybe we need to write a clear policy for what the issue > >>> > > priorities mean. > >>> > > > >>> > > Finally, I believe it's much better to have a culture where you can > >>> > > make > >>> > > releases on a regular schedule, and have the option to make a > >>> > > maintenance > >>> > > release in 3-4 days if you find new bugs, than one where you pile > up > >>> > stuff > >>> > > into each release. This is what much large project than us, like > >>> > > Linux, > >>> > do, > >>> > > and it's the only way to avoid indefinite stalling with a large > >>> > contributor > >>> > > base. In the worst case, if you find a new bug that warrants > immediate > >>> > > release, it goes into 1.0.1 a week after 1.0.0 (we can vote on > 1.0.1 > >>> > > in > >>> > > three days with just your bug fix in it). And if you find an API > that > >>> > you'd > >>> > > like to improve, just add a new one and maybe deprecate the old > one -- > >>> > > at > >>> > > some point we have to respect our users and let them know that code > >>> > > they > >>> > > write today will still run tomorrow. > >>> > > > >>> > > Matei > >>> > > > >>> > > On May 17, 2014, at 10:32 AM, Kan Zhang <kzh...@apache.org> wrote: > >>> > > > >>> > > > +1 on the running commentary here, non-binding of course :-) > >>> > > > > >>> > > > > >>> > > > On Sat, May 17, 2014 at 8:44 AM, Andrew Ash < > and...@andrewash.com> > >>> > > wrote: > >>> > > > > >>> > > >> +1 on the next release feeling more like a 0.10 than a 1.0 > >>> > > >> On May 17, 2014 4:38 AM, "Mridul Muralidharan" < > mri...@gmail.com> > >>> > > wrote: > >>> > > >> > >>> > > >>> I had echoed similar sentiments a while back when there was a > >>> > > discussion > >>> > > >>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize > >>> > > >>> the > >>> > api > >>> > > >>> changes, add missing functionality, go through a hardening > release > >>> > > before > >>> > > >>> 1.0 > >>> > > >>> > >>> > > >>> But the community preferred a 1.0 :-) > >>> > > >>> > >>> > > >>> Regards, > >>> > > >>> Mridul > >>> > > >>> > >>> > > >>> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> > wrote: > >>> > > >>>> > >>> > > >>>> On this note, non-binding commentary: > >>> > > >>>> > >>> > > >>>> Releases happen in local minima of change, usually created by > >>> > > >>>> internally enforced code freeze. Spark is incredibly busy now > due > >>> > > >>>> to > >>> > > >>>> external factors -- recently a TLP, recently discovered by a > >>> > > >>>> large > >>> > new > >>> > > >>>> audience, ease of contribution enabled by Github. It's getting > >>> > > >>>> like > >>> > > >>>> the first year of mainstream battle-testing in a month. It's > been > >>> > very > >>> > > >>>> hard to freeze anything! I see a number of non-trivial issues > >>> > > >>>> being > >>> > > >>>> reported, and I don't think it has been possible to triage > all of > >>> > > >>>> them, even. > >>> > > >>>> > >>> > > >>>> Given the high rate of change, my instinct would have been to > >>> > release > >>> > > >>>> 0.10.0 now. But won't it always be very busy? I do think the > rate > >>> > > >>>> of > >>> > > >>>> significant issues will slow down. > >>> > > >>>> > >>> > > >>>> Version ain't nothing but a number, but if it has any meaning > >>> > > >>>> it's > >>> > the > >>> > > >>>> semantic versioning meaning. 1.0 imposes extra handicaps > around > >>> > > >>>> striving to maintain backwards-compatibility. That may end up > >>> > > >>>> being > >>> > > >>>> bent to fit in important changes that are going to be > required in > >>> > this > >>> > > >>>> continuing period of change. Hadoop does this all the time > >>> > > >>>> unfortunately and gets away with it, I suppose -- minor > version > >>> > > >>>> releases are really major. (On the other extreme, HBase is at > >>> > > >>>> 0.98 > >>> > and > >>> > > >>>> quite production-ready.) > >>> > > >>>> > >>> > > >>>> Just consider this a second vote for focus on fixes and 1.0.x > >>> > > >>>> rather > >>> > > >>>> than new features and 1.x. I think there are a few steps that > >>> > > >>>> could > >>> > > >>>> streamline triage of this flood of contributions, and make > all of > >>> > this > >>> > > >>>> easier, but that's for another thread. > >>> > > >>>> > >>> > > >>>> > >>> > > >>>> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra < > >>> > > m...@clearstorydata.com > >>> > > >>> > >>> > > >>> wrote: > >>> > > >>>>> +1, but just barely. We've got quite a number of outstanding > >>> > > >>>>> bugs > >>> > > >>>>> identified, and many of them have fixes in progress. I'd > hate > >>> > > >>>>> to > >>> > see > >>> > > >>> those > >>> > > >>>>> efforts get lost in a post-1.0.0 flood of new features > targeted > >>> > > >>>>> at > >>> > > >>> 1.1.0 -- > >>> > > >>>>> in other words, I'd like to see 1.0.1 retain a high priority > >>> > relative > >>> > > >>> to > >>> > > >>>>> 1.1.0. > >>> > > >>>>> > >>> > > >>>>> Looking through the unresolved JIRAs, it doesn't look like > any > >>> > > >>>>> of > >>> > the > >>> > > >>>>> identified bugs are show-stoppers or strictly regressions > >>> > (although I > >>> > > >>> will > >>> > > >>>>> note that one that I have in progress, SPARK-1749, is a bug > that > >>> > > >>>>> we > >>> > > >>>>> introduced with recent work -- it's not strictly a regression > >>> > because > >>> > > >>> we > >>> > > >>>>> had equally bad but different behavior when the DAGScheduler > >>> > > >> exceptions > >>> > > >>>>> weren't previously being handled at all vs. being slightly > >>> > > >> mis-handled > >>> > > >>>>> now), so I'm not currently seeing a reason not to release. > >>> > > >>> > >>> > > >> > >>> > > > >>> > > > >>> > >