Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Sandy Ryza
This seems like a good idea.

An area that wasn't listed, but that I think could strongly benefit from
maintainers, is the build.  Having consistent oversight over Maven, SBT,
and dependencies would allow us to avoid subtle breakages.

Component maintainers have come up several times within the Hadoop project,
and I think one of the main reasons the proposals have been rejected is
that, structurally, its effect is to slow down development.  As you
mention, this is somewhat mitigated if being a maintainer leads committers
to take on more responsibility, but it might be worthwhile to draw up more
specific ideas on how to combat this?  E.g. do obvious changes, doc fixes,
test fixes, etc. always require a maintainer?

-Sandy

On Wed, Nov 5, 2014 at 5:36 PM, Michael Armbrust 
wrote:

> +1 (binding)
>
> On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia 
> wrote:
>
> > BTW, my own vote is obviously +1 (binding).
> >
> > Matei
> >
> > > On Nov 5, 2014, at 5:31 PM, Matei Zaharia 
> > wrote:
> > >
> > > Hi all,
> > >
> > > I wanted to share a discussion we've been having on the PMC list, as
> > well as call for an official vote on it on a public list. Basically, as
> the
> > Spark project scales up, we need to define a model to make sure there is
> > still great oversight of key components (in particular internal
> > architecture and public APIs), and to this end I've proposed
> implementing a
> > maintainer model for some of these components, similar to other large
> > projects.
> > >
> > > As background on this, Spark has grown a lot since joining Apache.
> We've
> > had over 80 contributors/month for the past 3 months, which I believe
> makes
> > us the most active project in contributors/month at Apache, as well as
> over
> > 500 patches/month. The codebase has also grown significantly, with new
> > libraries for SQL, ML, graphs and more.
> > >
> > > In this kind of large project, one common way to scale development is
> to
> > assign "maintainers" to oversee key components, where each patch to that
> > component needs to get sign-off from at least one of its maintainers.
> Most
> > existing large projects do this -- at Apache, some large ones with this
> > model are CloudStack (the second-most active project overall),
> Subversion,
> > and Kafka, and other examples include Linux and Python. This is also
> > by-and-large how Spark operates today -- most components have a de-facto
> > maintainer.
> > >
> > > IMO, adopting this model would have two benefits:
> > >
> > > 1) Consistent oversight of design for that component, especially
> > regarding architecture and API. This process would ensure that the
> > component's maintainers see all proposed changes and consider them to fit
> > together in a good way.
> > >
> > > 2) More structure for new contributors and committers -- in particular,
> > it would be easy to look up who’s responsible for each module and ask
> them
> > for reviews, etc, rather than having patches slip between the cracks.
> > >
> > > We'd like to start with in a light-weight manner, where the model only
> > applies to certain key components (e.g. scheduler, shuffle) and
> user-facing
> > APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand
> > it if we deem it useful. The specific mechanics would be as follows:
> > >
> > > - Some components in Spark will have maintainers assigned to them,
> where
> > one of the maintainers needs to sign off on each patch to the component.
> > > - Each component with maintainers will have at least 2 maintainers.
> > > - Maintainers will be assigned from the most active and knowledgeable
> > committers on that component by the PMC. The PMC can vote to add / remove
> > maintainers, and maintained components, through consensus.
> > > - Maintainers are expected to be active in responding to patches for
> > their components, though they do not need to be the main reviewers for
> them
> > (e.g. they might just sign off on architecture / API). To prevent
> inactive
> > maintainers from blocking the project, if a maintainer isn't responding
> in
> > a reasonable time period (say 2 weeks), other committers can merge the
> > patch, and the PMC will want to discuss adding another maintainer.
> > >
> > > If you'd like to see examples for this model, check out the following
> > projects:
> > > - CloudStack:
> >
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
> > <
> >
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
> > >
> > > - Subversion:
> > https://subversion.apache.org/docs/community-guide/roles.html <
> > https://subversion.apache.org/docs/community-guide/roles.html>
> > >
> > > Finally, I wanted to list our current proposal for initial components
> > and maintainers. It would be good to get feedback on other components we
> > might add, but please note that personnel discussions (e.g. "I don't
> think
> > Matei should maintain *that* component) should only happen on the privat

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Sandy Ryza
It looks like the difference between the proposed Spark model and the
CloudStack / SVN model is:
* In the former, maintainers / partial committers are a way of centralizing
oversight over particular components among committers
* In the latter, maintainers / partial committers are a way of giving
non-committers some power to make changes

-Sandy

On Thu, Nov 6, 2014 at 5:17 PM, Corey Nolet  wrote:

> PMC [1] is responsible for oversight and does not designate partial or full
> committer. There are projects where all committers become PMC and others
> where PMC is reserved for committers with the most merit (and willingness
> to take on the responsibility of project oversight, releases, etc...).
> Community maintains the codebase through committers. Committers to mentor,
> roll in patches, and spread the project throughout other communities.
>
> Adding someone's name to a list as a "maintainer" is not a barrier. With a
> community as large as Spark's, and myself not being a committer on this
> project, I see it as a welcome opportunity to find a mentor in the areas in
> which I'm interested in contributing. We'd expect the list of names to grow
> as more volunteers gain more interest, correct? To me, that seems quite
> contrary to a "barrier".
>
> [1] http://www.apache.org/dev/pmc.html
>
>
> On Thu, Nov 6, 2014 at 7:49 PM, Matei Zaharia 
> wrote:
>
> > So I don't understand, Greg, are the partial committers committers, or
> are
> > they not? Spark also has a PMC, but our PMC currently consists of all
> > committers (we decided not to have a differentiation when we left the
> > incubator). I see the Subversion partial committers listed as
> "committers"
> > on https://people.apache.org/committers-by-project.html#subversion, so I
> > assume they are committers. As far as I can see, CloudStack is similar.
> >
> > Matei
> >
> > > On Nov 6, 2014, at 4:43 PM, Greg Stein  wrote:
> > >
> > > Partial committers are people invited to work on a particular area, and
> > they do not require sign-off to work on that area. They can get a
> sign-off
> > and commit outside that area. That approach doesn't compare to this
> > proposal.
> > >
> > > Full committers are PMC members. As each PMC member is responsible for
> > *every* line of code, then every PMC member should have complete rights
> to
> > every line of code. Creating disparity flies in the face of a PMC
> member's
> > responsibility. If I am a Spark PMC member, then I have responsibility
> for
> > GraphX code, whether my name is Ankur, Joey, Reynold, or Greg. And
> > interposing a barrier inhibits my responsibility to ensure GraphX is
> > designed, maintained, and delivered to the Public.
> > >
> > > Cheers,
> > > -g
> > >
> > > (and yes, I'm aware of COMMITTERS; I've been changing that file for the
> > past 12 years :-) )
> > >
> > > On Thu, Nov 6, 2014 at 6:28 PM, Patrick Wendell  > > wrote:
> > > In fact, if you look at the subversion commiter list, the majority of
> > > people here have commit access only for particular areas of the
> > > project:
> > >
> > > http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS <
> > http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS>
> > >
> > > On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell  > > wrote:
> > > > Hey Greg,
> > > >
> > > > Regarding subversion - I think the reference is to partial vs full
> > > > committers here:
> > > > https://subversion.apache.org/docs/community-guide/roles.html <
> > https://subversion.apache.org/docs/community-guide/roles.html>
> > > >
> > > > - Patrick
> > > >
> > > > On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein   > gst...@gmail.com>> wrote:
> > > >> -1 (non-binding)
> > > >>
> > > >> This is an idea that runs COMPLETELY counter to the Apache Way, and
> is
> > > >> to be severely frowned up. This creates *unequal* ownership of the
> > > >> codebase.
> > > >>
> > > >> Each Member of the PMC should have *equal* rights to all areas of
> the
> > > >> codebase until their purview. It should not be subjected to others'
> > > >> "ownership" except throught the standard mechanisms of reviews and
> > > >> if/when absolutely necessary, to vetos.
> > > >>
> > > >> Apache does not want "leads", "benevolent dictators" or "assigned
> > > >> maintainers", no matter how you may dress it up with multiple
> > > >> maintainers per component. The fact is that this creates an unequal
> > > >> level of ownership and responsibility. The Board has shut down
> > > >> projects that attempted or allowed for "Leads". Just a few months
> ago,
> > > >> there was a problem with somebody calling themself a "Lead".
> > > >>
> > > >> I don't know why you suggest that Apache Subversion does this. We
> > > >> absolutely do not. Never have. Never will. The Subversion codebase
> is
> > > >> owned by all of us, and we all care for every line of it. Some
> people
> > > >> know more than others, of course. But any one of us, can change any
> > > >> part, without

proposal / discuss: multiple Serializers within a SparkContext?

2014-11-07 Thread Sandy Ryza
Hey all,

Was messing around with Spark and Google FlatBuffers for fun, and it got me
thinking about Spark and serialization.  I know there's been work / talk
about in-memory columnar formats Spark SQL, so maybe there are ways to
provide this flexibility already that I've missed?  Either way, my thoughts:

Java and Kryo serialization are really nice in that they require almost no
extra work on the part of the user.  They can also represent complex object
graphs with cycles etc.

There are situations where other serialization frameworks are more
efficient:
* A Hadoop Writable style format that delineates key-value boundaries and
allows for raw comparisons can greatly speed up some shuffle operations by
entirely avoiding deserialization until the object hits user code.
Writables also probably ser / deser faster than Kryo.
* "No-deserialization" formats like FlatBuffers and Cap'n Proto address the
tradeoff between (1) Java objects that offer fast access but take lots of
space and stress GC and (2) Kryo-serialized buffers that are more compact
but take time to deserialize.

The drawbacks of these frameworks are that they require more work from the
user to define types.  And that they're more restrictive in the reference
graphs they can represent.

In large applications, there are probably a few points where a
"specialized" serialization format is useful. But requiring Writables
everywhere because they're needed in a particularly intense shuffle is
cumbersome.

In light of that, would it make sense to enable varying Serializers within
an app? It could make sense to choose a serialization framework both based
on the objects being serialized and what they're being serialized for
(caching vs. shuffle).  It might be possible to implement this underneath
the Serializer interface with some sort of multiplexing serializer that
chooses between subserializers.

Nothing urgent here, but curious to hear other's opinions.

-Sandy


Re: proposal / discuss: multiple Serializers within a SparkContext?

2014-11-08 Thread Sandy Ryza
Ah awesome.  Passing customer serializers when persisting an RDD is exactly
one of the things I was thinking of.

-Sandy

On Fri, Nov 7, 2014 at 1:19 AM, Matei Zaharia 
wrote:

> Yup, the JIRA for this was https://issues.apache.org/jira/browse/SPARK-540
> (one of our older JIRAs). I think it would be interesting to explore this
> further. Basically the way to add it into the API would be to add a version
> of persist() that takes another class than StorageLevel, say
> StorageStrategy, which allows specifying a custom serializer or perhaps
> even a transformation to turn each partition into another representation
> before saving it. It would also be interesting if this could work directly
> on an InputStream or ByteBuffer to deal with off-heap data.
>
> One issue we've found with our current Serializer interface by the way is
> that a lot of type information is lost when you pass data to it, so the
> serializers spend a fair bit of time figuring out what class each object
> written is. With this model, it would be possible for a serializer to know
> that all its data is of one type, which is pretty cool, but we might also
> consider ways of expanding the current Serializer interface to take more
> info.
>
> Matei
>
> > On Nov 7, 2014, at 1:09 AM, Reynold Xin  wrote:
> >
> > Technically you can already do custom serializer for each shuffle
> operation
> > (it is part of the ShuffledRDD). I've seen Matei suggesting on jira
> issues
> > (or github) in the past a "storage policy" in which you can specify how
> > data should be stored. I think that would be a great API to have in the
> > long run. Designing it won't be trivial though.
> >
> >
> > On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza 
> wrote:
> >
> >> Hey all,
> >>
> >> Was messing around with Spark and Google FlatBuffers for fun, and it
> got me
> >> thinking about Spark and serialization.  I know there's been work / talk
> >> about in-memory columnar formats Spark SQL, so maybe there are ways to
> >> provide this flexibility already that I've missed?  Either way, my
> >> thoughts:
> >>
> >> Java and Kryo serialization are really nice in that they require almost
> no
> >> extra work on the part of the user.  They can also represent complex
> object
> >> graphs with cycles etc.
> >>
> >> There are situations where other serialization frameworks are more
> >> efficient:
> >> * A Hadoop Writable style format that delineates key-value boundaries
> and
> >> allows for raw comparisons can greatly speed up some shuffle operations
> by
> >> entirely avoiding deserialization until the object hits user code.
> >> Writables also probably ser / deser faster than Kryo.
> >> * "No-deserialization" formats like FlatBuffers and Cap'n Proto address
> the
> >> tradeoff between (1) Java objects that offer fast access but take lots
> of
> >> space and stress GC and (2) Kryo-serialized buffers that are more
> compact
> >> but take time to deserialize.
> >>
> >> The drawbacks of these frameworks are that they require more work from
> the
> >> user to define types.  And that they're more restrictive in the
> reference
> >> graphs they can represent.
> >>
> >> In large applications, there are probably a few points where a
> >> "specialized" serialization format is useful. But requiring Writables
> >> everywhere because they're needed in a particularly intense shuffle is
> >> cumbersome.
> >>
> >> In light of that, would it make sense to enable varying Serializers
> within
> >> an app? It could make sense to choose a serialization framework both
> based
> >> on the objects being serialized and what they're being serialized for
> >> (caching vs. shuffle).  It might be possible to implement this
> underneath
> >> the Serializer interface with some sort of multiplexing serializer that
> >> chooses between subserializers.
> >>
> >> Nothing urgent here, but curious to hear other's opinions.
> >>
> >> -Sandy
> >>
>
>


Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-12 Thread Sandy Ryza
Currently there are no mandatory profiles required to build Spark.  I.e.
"mvn package" just works.  It seems sad that we would need to break this.

On Wed, Nov 12, 2014 at 10:59 PM, Patrick Wendell 
wrote:

> I think printing an error that says "-Pscala-2.10 must be enabled" is
> probably okay. It's a slight regression but it's super obvious to
> users. That could be a more elegant solution than the somewhat
> complicated monstrosity I proposed on the JIRA.
>
> On Wed, Nov 12, 2014 at 10:37 PM, Prashant Sharma 
> wrote:
> > One thing we can do it is print a helpful error and break. I don't know
> > about how this can be done, but since now I can write groovy inside maven
> > build so we have more control. (Yay!!)
> >
> > Prashant Sharma
> >
> >
> >
> > On Thu, Nov 13, 2014 at 12:05 PM, Patrick Wendell 
> > wrote:
> >>
> >> Yeah Sandy and I were chatting about this today and din't realize
> >> -Pscala-2.10 was mandatory. This is a fairly invasive change, so I was
> >> thinking maybe we could try to remove that. Also if someone doesn't
> >> give -Pscala-2.10 it fails in a way that is initially silent, which is
> >> bad because most people won't know to do this.
> >>
> >> https://issues.apache.org/jira/browse/SPARK-4375
> >>
> >> On Wed, Nov 12, 2014 at 10:29 PM, Prashant Sharma  >
> >> wrote:
> >> > Thanks Patrick, I have one suggestion that we should make passing
> >> > -Pscala-2.10 mandatory for maven users. I am sorry for not mentioning
> >> > this
> >> > before. There is no way around not passing that option for maven
> >> > users(only). However, this is unnecessary for sbt users because it is
> >> > added
> >> > automatically if -Pscala-2.11 is absent.
> >> >
> >> >
> >> > Prashant Sharma
> >> >
> >> >
> >> >
> >> > On Wed, Nov 12, 2014 at 3:53 PM, Sean Owen 
> wrote:
> >> >
> >> >> - Tip: when you rebase, IntelliJ will temporarily think things like
> the
> >> >> Kafka module are being removed. Say 'no' when it asks if you want to
> >> >> remove
> >> >> them.
> >> >> - Can we go straight to Scala 2.11.4?
> >> >>
> >> >> On Wed, Nov 12, 2014 at 5:47 AM, Patrick Wendell  >
> >> >> wrote:
> >> >>
> >> >> > Hey All,
> >> >> >
> >> >> > I've just merged a patch that adds support for Scala 2.11 which
> will
> >> >> > have some minor implications for the build. These are due to the
> >> >> > complexities of supporting two versions of Scala in a single
> project.
> >> >> >
> >> >> > 1. The JDBC server will now require a special flag to build
> >> >> > -Phive-thriftserver on top of the existing flag -Phive. This is
> >> >> > because some build permutations (only in Scala 2.11) won't support
> >> >> > the
> >> >> > JDBC server yet due to transitive dependency conflicts.
> >> >> >
> >> >> > 2. The build now uses non-standard source layouts in a few
> additional
> >> >> > places (we already did this for the Hive project) - the repl and
> the
> >> >> > examples modules. This is just fine for maven/sbt, but it may
> affect
> >> >> > users who import the build in IDE's that are using these projects
> and
> >> >> > want to build Spark from the IDE. I'm going to update our wiki to
> >> >> > include full instructions for making this work well in IntelliJ.
> >> >> >
> >> >> > If there are any other build related issues please respond to this
> >> >> > thread and we'll make sure they get sorted out. Thanks to Prashant
> >> >> > Sharma who is the author of this feature!
> >> >> >
> >> >> > - Patrick
> >> >> >
> >> >> >
> -
> >> >> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> >> > For additional commands, e-mail: dev-h...@spark.apache.org
> >> >> >
> >> >> >
> >> >>
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Sandy Ryza
https://github.com/apache/spark/pull/3239 addresses this

On Thu, Nov 13, 2014 at 10:05 AM, Marcelo Vanzin 
wrote:

> Hello there,
>
> So I just took a quick look at the pom and I see two problems with it.
>
> - "activatedByDefault" does not work like you think it does. It only
> "activates by default" if you do not explicitly activate other
> profiles. So if you do "mvn package", scala-2.10 will be activated;
> but if you do "mvn -Pyarn package", it will not.
>
> - you need to duplicate the "activation" stuff everywhere where the
> profile is declared, not just in the root pom. (I spent quite some
> time yesterday fighting a similar issue...)
>
> My suggestion here is to change the activation of scala-2.10 to look like
> this:
>
> 
>   
> !scala-2.11
>   
> 
>
> And change the scala-2.11 profile to do this:
>
> 
>   true
> 
>
> I haven't tested, but in my experience this will activate the
> scala-2.10 profile by default, unless you explicitly activate the 2.11
> profile, in which case that property will be set and scala-2.10 will
> not activate. If you look at examples/pom.xml, that's the same
> strategy used to choose which hbase profile to activate.
>
> Ah, and just to reinforce, the activation logic needs to be copied to
> other places (e.g. examples/pom.xml, repl/pom.xml, and any other place
> that has scala-2.x profiles).
>
>
>
> On Wed, Nov 12, 2014 at 11:14 PM, Patrick Wendell 
> wrote:
> > I actually do agree with this - let's see if we can find a solution
> > that doesn't regress this behavior. Maybe we can simply move the one
> > kafka example into its own project instead of having it in the
> > examples project.
> >
> > On Wed, Nov 12, 2014 at 11:07 PM, Sandy Ryza 
> wrote:
> >> Currently there are no mandatory profiles required to build Spark.  I.e.
> >> "mvn package" just works.  It seems sad that we would need to break
> this.
> >>
> >> On Wed, Nov 12, 2014 at 10:59 PM, Patrick Wendell 
> >> wrote:
> >>>
> >>> I think printing an error that says "-Pscala-2.10 must be enabled" is
> >>> probably okay. It's a slight regression but it's super obvious to
> >>> users. That could be a more elegant solution than the somewhat
> >>> complicated monstrosity I proposed on the JIRA.
> >>>
> >>> On Wed, Nov 12, 2014 at 10:37 PM, Prashant Sharma <
> scrapco...@gmail.com>
> >>> wrote:
> >>> > One thing we can do it is print a helpful error and break. I don't
> know
> >>> > about how this can be done, but since now I can write groovy inside
> >>> > maven
> >>> > build so we have more control. (Yay!!)
> >>> >
> >>> > Prashant Sharma
> >>> >
> >>> >
> >>> >
> >>> > On Thu, Nov 13, 2014 at 12:05 PM, Patrick Wendell <
> pwend...@gmail.com>
> >>> > wrote:
> >>> >>
> >>> >> Yeah Sandy and I were chatting about this today and din't realize
> >>> >> -Pscala-2.10 was mandatory. This is a fairly invasive change, so I
> was
> >>> >> thinking maybe we could try to remove that. Also if someone doesn't
> >>> >> give -Pscala-2.10 it fails in a way that is initially silent, which
> is
> >>> >> bad because most people won't know to do this.
> >>> >>
> >>> >> https://issues.apache.org/jira/browse/SPARK-4375
> >>> >>
> >>> >> On Wed, Nov 12, 2014 at 10:29 PM, Prashant Sharma
> >>> >> 
> >>> >> wrote:
> >>> >> > Thanks Patrick, I have one suggestion that we should make passing
> >>> >> > -Pscala-2.10 mandatory for maven users. I am sorry for not
> mentioning
> >>> >> > this
> >>> >> > before. There is no way around not passing that option for maven
> >>> >> > users(only). However, this is unnecessary for sbt users because
> it is
> >>> >> > added
> >>> >> > automatically if -Pscala-2.11 is absent.
> >>> >> >
> >>> >> >
> >>> >> > Prashant Sharma
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > On Wed, Nov 12, 2014 at 3:53 PM, Sean Owen 
> >>> >> > wrote:
> >>> >> >
>

Re: Spark & Hadoop 2.5.1

2014-11-14 Thread sandy . ryza
You're the second person to request this today. Planning to include this in my 
PR for Spark-4338.

-Sandy

> On Nov 14, 2014, at 8:48 AM, Corey Nolet  wrote:
> 
> In the past, I've built it by providing -Dhadoop.version=2.5.1 exactly like
> you've mentioned. What prompted me to write this email was that I did not
> see any documentation that told me Hadoop 2.5.1 was officially supported by
> Spark (i.e. community has been using it, any bugs are being fixed, etc...).
> It builds, tests pass, etc... but there could be other implications that I
> have not run into based on my own use of the framework.
> 
> If we are saying that the standard procedure is to build with the
> hadoop-2.4 profile and override the -Dhadoop.version property, should we
> provide that on the build instructions [1] at least?
> 
> [1] http://spark.apache.org/docs/latest/building-with-maven.html
> 
>> On Fri, Nov 14, 2014 at 10:46 AM, Sean Owen  wrote:
>> 
>> I don't think it's necessary. You're looking at the hadoop-2.4
>> profile, which works with anything >= 2.4. AFAIK there is no further
>> specialization needed beyond that. The profile sets hadoop.version to
>> 2.4.0 by default, but this can be overridden.
>> 
>>> On Fri, Nov 14, 2014 at 3:43 PM, Corey Nolet  wrote:
>>> I noticed Spark 1.2.0-SNAPSHOT still has 2.4.x in the pom. Since 2.5.x is
>>> the current stable Hadoop 2.x, would it make sense for us to update the
>>> poms?
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Too many open files error

2014-11-19 Thread Sandy Ryza
Quizhang,

This is a known issue that ExternalAppendOnlyMap can do tons of tiny spills
in certain situations. SPARK-4452 aims to deal with this issue, but we
haven't finalized a solution yet.

Dinesh's solution should help as a workaround, but you'll likely experience
suboptimal performance when trying to merge tons of small files from disk.

-Sandy

On Wed, Nov 19, 2014 at 10:10 PM, Dinesh J. Weerakkody <
dineshjweerakk...@gmail.com> wrote:

> Hi Qiuzhuang,
>
> This is a linux related issue. Please go through this [1] and change the
> limits. hope this will solve your problem.
>
> [1] https://rtcamp.com/tutorials/linux/increase-open-files-limit/
>
> On Thu, Nov 20, 2014 at 9:45 AM, Qiuzhuang Lian 
> wrote:
>
> > Hi All,
> >
> > While doing some ETL, I  run into error of 'Too many open files' as
> > following logs,
> >
> > Thanks,
> > Qiuzhuang
> >
> > 4/11/20 20:12:02 INFO collection.ExternalAppendOnlyMap: Thread 63
> spilling
> > in-memory map of 100.8 KB to disk (953 times so far)
> > 14/11/20 20:12:02 ERROR storage.DiskBlockObjectWriter: Uncaught exception
> > while reverting partial writes to file
> >
> >
> /tmp/spark-local-20141120200455-4137/2f/temp_local_f83cbf2f-60a4-4fbd-b5d2-32a0c569311b
> > java.io.FileNotFoundException:
> >
> >
> /tmp/spark-local-20141120200455-4137/2f/temp_local_f83cbf2f-60a4-4fbd-b5d2-32a0c569311b
> > (Too many open files)
> > at java.io.FileOutputStream.open(Native Method)
> > at java.io.FileOutputStream.(FileOutputStream.java:221)
> > at
> >
> >
> org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(BlockObjectWriter.scala:178)
> > at
> >
> >
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:203)
> > at
> >
> >
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:63)
> > at
> >
> >
> org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:77)
> > at
> >
> >
> org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:63)
> > at
> >
> >
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:131)
> > at
> >
> >
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
> > at
> >
> >
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> > at
> >
> >
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> > at
> >
> >
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> > at
> > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> > at
> >
> >
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> > at
> > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
> > at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> > at
> > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> > at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> > at
> >
> >
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> > at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> > at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> > at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> > at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> > at
> > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
> > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> > at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> > at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> > at
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> > at org.apache.spark.scheduler.Task.run(Task.scala:56)
> > at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > at j

Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-01 Thread Sandy Ryza
+1 (non-binding)

built from source
fired up a spark-shell against YARN cluster
ran some jobs using parallelize
ran some jobs that read files
clicked around the web UI


On Sun, Nov 30, 2014 at 1:10 AM, GuoQiang Li  wrote:

> +1 (non-binding‍)
>
>
>
>
> -- Original --
> From:  "Patrick Wendell";;
> Date:  Sat, Nov 29, 2014 01:16 PM
> To:  "dev@spark.apache.org";
>
> Subject:  [VOTE] Release Apache Spark 1.2.0 (RC1)
>
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 1.2.0!
>
> The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.0-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1048/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.2.0!
>
> The vote is open until Tuesday, December 02, at 05:15 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.1.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == What justifies a -1 vote for this release? ==
> This vote is happening very late into the QA period compared with
> previous votes, so -1 votes should only occur for significant
> regressions from 1.0.2. Bugs already present in 1.1.X, minor
> regressions, or bugs related to new features will not block this
> release.
>
> == What default changes should I be aware of? ==
> 1. The default value of "spark.shuffle.blockTransferService" has been
> changed to "netty"
> --> Old behavior can be restored by switching to "nio"
>
> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
> --> Old behavior can be restored by setting "spark.shuffle.manager" to
> "hash".
>
> == Other notes ==
> Because this vote is occurring over a weekend, I will likely extend
> the vote if this RC survives until the end of the vote period.
>
> - Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>


Re: HA support for Spark

2014-12-10 Thread Sandy Ryza
I think that if we were able to maintain the full set of created RDDs as
well as some scheduler and block manager state, it would be enough for most
apps to recover.

On Wed, Dec 10, 2014 at 5:30 AM, Jun Feng Liu  wrote:

> Well, it should not be mission impossible thinking there are so many HA
> solution existing today. I would interest to know if there is any specific
> difficult.
>
> Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
>   --
>  [image: 2D barcode - encoded with contact information] *Phone: 
> *86-10-82452683
>
> * E-mail:* *liuj...@cn.ibm.com* 
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>
>  *Reynold Xin >*
>
> 2014/12/10 16:30
>   To
> Jun Feng Liu/China/IBM@IBMCN,
> cc
> "dev@spark.apache.org" 
> Subject
> Re: HA support for Spark
>
>
>
>
> This would be plausible for specific purposes such as Spark streaming or
> Spark SQL, but I don't think it is doable for general Spark driver since it
> is just a normal JVM process with arbitrary program state.
>
> On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu  wrote:
>
> > Do we have any high availability support in Spark driver level? For
> > example, if we want spark drive can move to another node continue
> execution
> > when failure happen. I can see the RDD checkpoint can help to
> serialization
> > the status of RDD. I can image to load the check point from another node
> > when error happen, but seems like will lost track all tasks status or
> even
> > executor information that maintain in spark context. I am not sure if
> there
> > is any existing stuff I can leverage to do that. thanks for any suggests
> >
> > Best Regards
> >
> >
> > *Jun Feng Liu*
> > IBM China Systems & Technology Laboratory in Beijing
> >
> >   --
> >  [image: 2D barcode - encoded with contact information] *Phone:
> *86-10-82452683
> >
> > * E-mail:* *liuj...@cn.ibm.com* 
> > [image: IBM]
> >
> > BLD 28,ZGC Software Park
> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> > China
> >
> >
> >
> >
> >
>
>


Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Sandy Ryza
+1 (non-binding).  Tested on Ubuntu against YARN.

On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin  wrote:

> +1
>
> Tested on OS X.
>
> On Wednesday, December 10, 2014, Patrick Wendell 
> wrote:
>
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.2.0!
> >
> > The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.0-rc2/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1055/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
> >
> > Please vote on releasing this package as Apache Spark 1.2.0!
> >
> > The vote is open until Saturday, December 13, at 21:00 UTC and passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.2.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
> > == What justifies a -1 vote for this release? ==
> > This vote is happening relatively late into the QA period, so
> > -1 votes should only occur for significant regressions from
> > 1.0.2. Bugs already present in 1.1.X, minor
> > regressions, or bugs related to new features will not block this
> > release.
> >
> > == What default changes should I be aware of? ==
> > 1. The default value of "spark.shuffle.blockTransferService" has been
> > changed to "netty"
> > --> Old behavior can be restored by switching to "nio"
> >
> > 2. The default value of "spark.shuffle.manager" has been changed to
> "sort".
> > --> Old behavior can be restored by setting "spark.shuffle.manager" to
> > "hash".
> >
> > == How does this differ from RC1 ==
> > This has fixes for a handful of issues identified - some of the
> > notable fixes are:
> >
> > [Core]
> > SPARK-4498: Standalone Master can fail to recognize completed/failed
> > applications
> >
> > [SQL]
> > SPARK-4552: Query for empty parquet table in spark sql hive get
> > IllegalArgumentException
> > SPARK-4753: Parquet2 does not prune based on OR filters on partition
> > columns
> > SPARK-4761: With JDBC server, set Kryo as default serializer and
> > disable reference tracking
> > SPARK-4785: When called with arguments referring column fields, PMOD
> > throws NPE
> >
> > - Patrick
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > For additional commands, e-mail: dev-h...@spark.apache.org
> 
> >
> >
>


Re: one hot encoding

2014-12-13 Thread Sandy Ryza
Hi Lochana,

We haven't yet added this in 1.2.
https://issues.apache.org/jira/browse/SPARK-4081 tracks adding categorical
feature indexing, which one-hot encoding can be built on.
https://issues.apache.org/jira/browse/SPARK-1216 also tracks a version of
this prior to the ML pipelines work.

-Sandy

On Fri, Dec 12, 2014 at 6:16 PM, Lochana Menikarachchi 
wrote:
>
> Do we have one-hot encoding in spark MLLib 1.1.1 or 1.2.0 ? It wasn't
> available in 1.1.0.
> Thanks.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Spark Dev

2014-12-19 Thread Sandy Ryza
Hi Harikrishna,

A good place to start is taking a look at the wiki page on contributing:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

-Sandy

On Fri, Dec 19, 2014 at 2:43 PM, Harikrishna Kamepalli <
harikrishna.kamepa...@gmail.com> wrote:
>
> i am interested to contribute to spark
>


Re: Semantics of LGTM

2015-01-17 Thread sandy . ryza
I think clarifying these semantics is definitely worthwhile. Maybe this 
complicates the process with additional terminology, but the way I've used 
these has been:

+1 - I think this is safe to merge and, barring objections from others, would 
merge it immediately.

LGTM - I have no concerns about this patch, but I don't necessarily feel 
qualified to make a final call about it.  The TM part acknowledges the judgment 
as a little more subjective.

I think having some concise way to express both of these is useful.

-Sandy

> On Jan 17, 2015, at 5:40 PM, Patrick Wendell  wrote:
> 
> Hey All,
> 
> Just wanted to ping about a minor issue - but one that ends up having
> consequence given Spark's volume of reviews and commits. As much as
> possible, I think that we should try and gear towards "Google Style"
> LGTM on reviews. What I mean by this is that LGTM has the following
> semantics:
> 
> "I know this code well, or I've looked at it close enough to feel
> confident it should be merged. If there are issues/bugs with this code
> later on, I feel confident I can help with them."
> 
> Here is an alternative semantic:
> 
> "Based on what I know about this part of the code, I don't see any
> show-stopper problems with this patch".
> 
> The issue with the latter is that it ultimately erodes the
> significance of LGTM, since subsequent reviewers need to reason about
> what the person meant by saying LGTM. In contrast, having strong
> semantics around LGTM can help streamline reviews a lot, especially as
> reviewers get more experienced and gain trust from the comittership.
> 
> There are several easy ways to give a more limited endorsement of a patch:
> - "I'm not familiar with this code, but style, etc look good" (general
> endorsement)
> - "The build changes in this code LGTM, but I haven't reviewed the
> rest" (limited LGTM)
> 
> If people are okay with this, I might add a short note on the wiki.
> I'm sending this e-mail first, though, to see whether anyone wants to
> express agreement or disagreement with this approach.
> 
> - Patrick
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Semantics of LGTM

2015-01-17 Thread sandy . ryza
Yeah, the ASF +1 has become partly overloaded to mean both "I would like to see 
this feature" and "this patch should be committed", although, at least in 
Hadoop, using +1 on JIRA (as opposed to, say, in a release vote) should 
unambiguously mean the latter unless qualified in some other way.

I don't have any opinion on the specific characters, but I agree with Aaron 
that it would be nice to have some sort of abbreviation for both the strong and 
weak forms of approval.

-Sandy

> On Jan 17, 2015, at 7:25 PM, Patrick Wendell  wrote:
> 
> I think the ASF +1 is *slightly* different than Google's LGTM, because
> it might convey wanting the patch/feature to be merged but not
> necessarily saying you did a thorough review and stand behind it's
> technical contents. For instance, I've seen people pile on +1's to try
> and indicate support for a feature or patch in some projects, even
> though they didn't do a thorough technical review. This +1 is
> definitely a useful mechanism.
> 
> There is definitely much overlap though in the meaning, though, and
> it's largely because Spark had it's own culture around reviews before
> it was donated to the ASF, so there is a mix of two styles.
> 
> Nonetheless, I'd prefer to stick with the stronger LGTM semantics I
> proposed originally (unlike the one Sandy proposed, e.g.). This is
> what I've seen every project using the LGTM convention do (Google, and
> some open source projects such as Impala) to indicate technical
> sign-off.
> 
> - Patrick
> 
>> On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson  wrote:
>> I think I've seen something like +2 = "strong LGTM" and +1 = "weak LGTM;
>> someone else should review" before. It's nice to have a shortcut which isn't
>> a sentence when talking about weaker forms of LGTM.
>> 
>> On Sat, Jan 17, 2015 at 6:59 PM,  wrote:
>>> 
>>> I think clarifying these semantics is definitely worthwhile. Maybe this
>>> complicates the process with additional terminology, but the way I've used
>>> these has been:
>>> 
>>> +1 - I think this is safe to merge and, barring objections from others,
>>> would merge it immediately.
>>> 
>>> LGTM - I have no concerns about this patch, but I don't necessarily feel
>>> qualified to make a final call about it.  The TM part acknowledges the
>>> judgment as a little more subjective.
>>> 
>>> I think having some concise way to express both of these is useful.
>>> 
>>> -Sandy
>>> 
 On Jan 17, 2015, at 5:40 PM, Patrick Wendell  wrote:
 
 Hey All,
 
 Just wanted to ping about a minor issue - but one that ends up having
 consequence given Spark's volume of reviews and commits. As much as
 possible, I think that we should try and gear towards "Google Style"
 LGTM on reviews. What I mean by this is that LGTM has the following
 semantics:
 
 "I know this code well, or I've looked at it close enough to feel
 confident it should be merged. If there are issues/bugs with this code
 later on, I feel confident I can help with them."
 
 Here is an alternative semantic:
 
 "Based on what I know about this part of the code, I don't see any
 show-stopper problems with this patch".
 
 The issue with the latter is that it ultimately erodes the
 significance of LGTM, since subsequent reviewers need to reason about
 what the person meant by saying LGTM. In contrast, having strong
 semantics around LGTM can help streamline reviews a lot, especially as
 reviewers get more experienced and gain trust from the comittership.
 
 There are several easy ways to give a more limited endorsement of a
 patch:
 - "I'm not familiar with this code, but style, etc look good" (general
 endorsement)
 - "The build changes in this code LGTM, but I haven't reviewed the
 rest" (limited LGTM)
 
 If people are okay with this, I might add a short note on the wiki.
 I'm sending this e-mail first, though, to see whether anyone wants to
 express agreement or disagreement with this approach.
 
 - Patrick
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Issue with repartition and cache

2015-01-21 Thread Sandy Ryza
Hi Dirceu,

Does the issue not show up if you run "map(f =>
f(1).asInstanceOf[Int]).sum" on the "train" RDD?  It appears that f(1) is
an String, not an Int.  If you're looking to parse and convert it, "toInt"
should be used instead of "asInstanceOf".

-Sandy

On Wed, Jan 21, 2015 at 8:43 AM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:

> Hi guys, have anyone find something like this?
> I have a training set, and when I repartition it, if I call cache it throw
> a classcastexception when I try to execute anything that access it
>
> val rep120 = train.repartition(120)
> val cached120 = rep120.cache
> cached120.map(f => f(1).asInstanceOf[Int]).sum
>
> Cell Toolbar:
>In [1]:
>
> ClusterSettings.executorMemory=Some("28g")
>
> ClusterSettings.maxResultSize = "20g"
>
> ClusterSettings.resume=true
>
> ClusterSettings.coreInstanceType="r3.xlarge"
>
> ClusterSettings.coreInstanceCount = 30
>
> ClusterSettings.clusterName="UberdataContextCluster-Dirceu"
>
> uc.applyDateFormat("YYMMddHH")
>
> Searching for existing cluster UberdataContextCluster-Dirceu ...
> Spark standalone cluster started at
> http://ec2-54-68-91-64.us-west-2.compute.amazonaws.com:8080
> Found 1 master(s), 30 slaves
> Ganglia started at
> http://ec2-54-68-91-64.us-west-2.compute.amazonaws.com:5080/ganglia
>
> In [37]:
>
> import org.apache.spark.sql.catalyst.types._
>
> import eleflow.uberdata.util.IntStringImplicitTypeConverter._
>
> import eleflow.uberdata.enums.SupportedAlgorithm._
>
> import eleflow.uberdata.data._
>
> import org.apache.spark.mllib.tree.DecisionTree
>
> import eleflow.uberdata.enums.DateSplitType._
>
> import org.apache.spark.mllib.regression.LabeledPoint
>
> import org.apache.spark.mllib.linalg.Vectors
>
> import org.apache.spark.mllib.classification._
>
> import eleflow.uberdata.model._
>
> import eleflow.uberdata.data.stat.Statistics
>
> import eleflow.uberdata.enums.ValidationMethod._
>
> import org.apache.spark.rdd.RDD
>
> In [5]:
>
> val train =
> uc.load(uc.toHDFSURI("/tmp/data/input/train_rev4.csv")).applyColumnTypes(Seq(DecimalType(),
> LongType,TimestampType, StringType,
>
>
>  StringType, StringType, StringType, StringType,
>
>
>   StringType, StringType, StringType, StringType,
>
>
>   StringType, StringType, StringType, StringType,
>
>
>  StringType, StringType, StringType, StringType,
>
>
>   LongType, LongType,StringType, StringType,StringType,
>
>
>   StringType,StringType))
>
> .formatDateValues(2,DayOfAWeek | Period).slice(excludes = Seq(12,13))
>
> Out[5]:
>
> idclickhour1hour2C1banner_possite_idsite_domainsite_categoryapp_idapp_domain
> app_categorydevice_modeldevice_typedevice_conn_typeC14C15C16C17C18C19C20C21
> 10941815109427302.03.0100501fbe01fef384576728905ebdecad23867801e8d9
> 07d7df2244956a241215706320501722035-179116934911786371502.03.010050
> 1fbe01fef384576728905ebdecad23867801e8d907d7df22711ee1201015704320501722035
> 10008479137190421511948602.03.0100501fbe01fef384576728905ebdecad2386
> 7801e8d907d7df228a4875bd101570432050172203510008479164072448083837602.0
>
> 3.0100501fbe01fef384576728905ebdecad23867801e8d907d7df226332421a101570632050
> 172203510008479167905641704209602.03.010051fe8cc4489166c1610569f928
>
> ecad23867801e8d907d7df22779d90c21018993320502161035-11571720757801103869
> 02.03.010050d6137915bb1ef334f028772becad23867801e8d907d7df228a4875bd1016920
> 3205018990431100077117172472998854491102.03.0100508fda644b25d4cfcd
> f028772becad23867801e8d907d7df22be6db1d71020362320502333039-1157
> In [7]:
>
> val test =
> uc.load(uc.toHDFSURI("/tmp/data/input/test_rev4.csv")).applyColumnTypes(Seq(DecimalType(),
> TimestampType, StringType,
>
>
>  StringType, StringType, StringType, StringType,
>
>
>   StringType, StringType, StringType, StringType,
>
>
>   StringType, StringType, StringType, StringType,
>
>
>  StringType, StringType, StringType, StringType,
>
>
>   LongType, LongType,StringType, StringType,StringType,
>
>
>   StringType,StringType)).
>
> formatDateValues(1,DayOfAWeek | Period).slice(excludes =Seq(11,12))
>
> Out[7]:
> idhour1hour2C1banner_possite_idsite_domainsite_categoryapp_idapp_domain
> app_categorydevice_modeldevice_typedevice_conn_typeC14C15C16C17C18C19C20C21
> 11740588092635695.03.010050235ba823f6ebf28ef028772becad23867801e8d9
> 07d7df220eb711ec1083303205076131751000752311825269208554285.03.010050
> 1fbe01fef384576728905ebdecad23867801e8d907d7df22ecb851b21022676320502616035
> 1000835115541398292139845.03.0100501fbe01fef384576728905ebdecad2386
> 7801e8d907d7df221f0bc64f102267632050261603510008351100010946378097988455.0
>
> 3.01005085f751fdc4e18dd650e219e051cedd4eaefc06bd0f2161f8542422a7101864832050
> 1092380910015661100013770415586707455.03.01005085f751fdc4e18dd650e219e0
>
> 9c13b4192347f47af95efa071f0bc64f1023160320502667047-122110001521204153353724
>
> 5.03.01005157fe1b205b626596

Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Sandy Ryza
Both SchemaRDD and DataFrame sound fine to me, though I like the former
slightly better because it's more descriptive.

Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would
be more clear from a user-facing perspective to at least choose a package
name for it that omits "sql".

I would also be in favor of adding a separate Spark Schema module for Spark
SQL to rely on, but I imagine that might be too large a change at this
point?

-Sandy

On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia 
wrote:

> (Actually when we designed Spark SQL we thought of giving it another name,
> like Spark Schema, but we decided to stick with SQL since that was the most
> obvious use case to many users.)
>
> Matei
>
> > On Jan 26, 2015, at 5:31 PM, Matei Zaharia 
> wrote:
> >
> > While it might be possible to move this concept to Spark Core long-term,
> supporting structured data efficiently does require quite a bit of the
> infrastructure in Spark SQL, such as query planning and columnar storage.
> The intent of Spark SQL though is to be more than a SQL server -- it's
> meant to be a library for manipulating structured data. Since this is
> possible to build over the core API, it's pretty natural to organize it
> that way, same as Spark Streaming is a library.
> >
> > Matei
> >
> >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers  wrote:
> >>
> >> "The context is that SchemaRDD is becoming a common data format used for
> >> bringing data into Spark from external systems, and used for various
> >> components of Spark, e.g. MLlib's new pipeline API."
> >>
> >> i agree. this to me also implies it belongs in spark core, not sql
> >>
> >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> michaelma...@yahoo.com.invalid> wrote:
> >>
> >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> Area
> >>> Spark Meetup YouTube contained a wealth of background information on
> this
> >>> idea (mostly from Patrick and Reynold :-).
> >>>
> >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>>
> >>> 
> >>> From: Patrick Wendell 
> >>> To: Reynold Xin 
> >>> Cc: "dev@spark.apache.org" 
> >>> Sent: Monday, January 26, 2015 4:01 PM
> >>> Subject: Re: renaming SchemaRDD -> DataFrame
> >>>
> >>>
> >>> One thing potentially not clear from this e-mail, there will be a 1:1
> >>> correspondence where you can get an RDD to/from a DataFrame.
> >>>
> >>>
> >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin 
> wrote:
>  Hi,
> 
>  We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> to
>  get the community's opinion.
> 
>  The context is that SchemaRDD is becoming a common data format used
> for
>  bringing data into Spark from external systems, and used for various
>  components of Spark, e.g. MLlib's new pipeline API. We also expect
> more
> >>> and
>  more users to be programming directly against SchemaRDD API rather
> than
> >>> the
>  core RDD API. SchemaRDD, through its less commonly used DSL originally
>  designed for writing test cases, always has the data-frame like API.
> In
>  1.3, we are redesigning the API to make the API usable for end users.
> 
> 
>  There are two motivations for the renaming:
> 
>  1. DataFrame seems to be a more self-evident name than SchemaRDD.
> 
>  2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> (even
>  though it would contain some RDD functions like map, flatMap, etc),
> and
>  calling it Schema*RDD* while it is not an RDD is highly confusing.
> >>> Instead.
>  DataFrame.rdd will return the underlying RDD for all RDD methods.
> 
> 
>  My understanding is that very few users program directly against the
>  SchemaRDD API at the moment, because they are not well documented.
> >>> However,
>  oo maintain backward compatibility, we can create a type alias
> DataFrame
>  that is still named SchemaRDD. This will maintain source compatibility
> >>> for
>  Scala. That said, we will have to update all existing materials to use
>  DataFrame rather than SchemaRDD.
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>>
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Improving metadata in Spark JIRA

2015-02-06 Thread Sandy Ryza
JIRA updates don't go to this list, they go to iss...@spark.apache.org.  I
don't think many are signed up for that list, and those that are probably
have a flood of emails anyway.

So I'd definitely be in favor of any JIRA cleanup that you're up for.

-Sandy

On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen  wrote:

> I've wasted no time in wielding the commit bit to complete a number of
> small, uncontroversial changes. I wouldn't commit anything that didn't
> already appear to have review, consensus and little risk, but please
> let me know if anything looked a little too bold, so I can calibrate.
>
>
> Anyway, I'd like to continue some small house-cleaning by improving
> the state of JIRA's metadata, in order to let it give us a little
> clearer view on what's happening in the project:
>
> a. Add Component to every (open) issue that's missing one
> b. Review all Critical / Blocker issues to de-escalate ones that seem
> obviously neither
> c. Correct open issues that list a Fix version that has already been
> released
> d. Close all issues Resolved for a release that has already been released
>
> The problem with doing so is that it will create a tremendous amount
> of email to the list, like, several hundred. It's possible to make
> bulk changes and suppress e-mail though, which could be done for all
> but b.
>
> Better to suppress the emails when making such changes? or just not
> bother on some of these?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: multi-line comment style

2015-02-09 Thread Sandy Ryza
+1 to what Andrew said, I think both make sense in different situations and
trusting developer discretion here is reasonable.

On Mon, Feb 9, 2015 at 1:48 PM, Andrew Or  wrote:

> In my experience I find it much more natural to use // for short multi-line
> comments (2 or 3 lines), and /* */ for long multi-line comments involving
> one or more paragraphs. For short multi-line comments, there is no reason
> not to use // if it just so happens that your first line exceeded 100
> characters and you have to wrap it. For long multi-line comments, however,
> using // all the way looks really awkward especially if you have multiple
> paragraphs.
>
> Thus, I would actually suggest that we don't try to pick a favorite and
> document that both are acceptable. I don't expect developers to follow my
> exact usage (i.e. with a tipping point of 2-3 lines) so I wouldn't enforce
> anything specific either.
>
> 2015-02-09 13:36 GMT-08:00 Reynold Xin :
>
> > Why don't we just pick // as the default (by encouraging it in the style
> > guide), since it is mostly used, and then do not disallow /* */? I don't
> > think it is that big of a deal to have slightly deviations here since it
> is
> > dead simple to understand what's going on.
> >
> >
> > On Mon, Feb 9, 2015 at 1:33 PM, Patrick Wendell 
> > wrote:
> >
> > > Clearly there isn't a strictly optimal commenting format (pro's and
> > > cons for both '//' and '/*'). My thought is for consistency we should
> > > just chose one and put in the style guide.
> > >
> > > On Mon, Feb 9, 2015 at 12:25 PM, Xiangrui Meng 
> wrote:
> > > > Btw, I think allowing `/* ... */` without the leading `*` in lines is
> > > > also useful. Check this line:
> > > >
> > >
> >
> https://github.com/apache/spark/pull/4259/files#diff-e9dcb3b5f3de77fc31b3aff7831110eaR55
> > > ,
> > > > where we put the R commands that can reproduce the test result. It is
> > > > easier if we write in the following style:
> > > >
> > > > ~~~
> > > > /*
> > > >  Using the following R code to load the data and train the model
> using
> > > > glmnet package.
> > > >
> > > >  library("glmnet")
> > > >  data <- read.csv("path", header=FALSE, stringsAsFactors=FALSE)
> > > >  features <- as.matrix(data.frame(as.numeric(data$V2),
> > > as.numeric(data$V3)))
> > > >  label <- as.numeric(data$V1)
> > > >  weights <- coef(glmnet(features, label, family="gaussian", alpha =
> 0,
> > > > lambda = 0))
> > > >  */
> > > > ~~~
> > > >
> > > > So people can copy & paste the R commands directly.
> > > >
> > > > Xiangrui
> > > >
> > > > On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng 
> > wrote:
> > > >> I like the `/* .. */` style more. Because it is easier for IDEs to
> > > >> recognize it as a block comment. If you press enter in the comment
> > > >> block with the `//` style, IDEs won't add `//` for you. -Xiangrui
> > > >>
> > > >> On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin 
> > > wrote:
> > > >>> We should update the style doc to reflect what we have in most
> places
> > > >>> (which I think is //).
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman <
> > > >>> shiva...@eecs.berkeley.edu> wrote:
> > > >>>
> > >  FWIW I like the multi-line // over /* */ from a purely style
> > > standpoint.
> > >  The Google Java style guide[1] has some comment about code
> > formatting
> > > tools
> > >  working better with /* */ but there doesn't seem to be any strong
> > > arguments
> > >  for one over the other I can find
> > > 
> > >  Thanks
> > >  Shivaram
> > > 
> > >  [1]
> > > 
> > > 
> > >
> >
> https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style
> > > 
> > >  On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell <
> pwend...@gmail.com
> > >
> > >  wrote:
> > > 
> > >  > Personally I have no opinion, but agree it would be nice to
> > > standardize.
> > >  >
> > >  > - Patrick
> > >  >
> > >  > On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen 
> > > wrote:
> > >  > > One thing Marcelo pointed out to me is that the // style does
> > not
> > >  > > interfere with commenting out blocks of code with /* */, which
> > is
> > > a
> > >  > > small good thing. I am also accustomed to // style for
> > multiline,
> > > and
> > >  > > reserve /** */ for javadoc / scaladoc. Meaning, seeing the /*
> */
> > > style
> > >  > > inline always looks a little funny to me.
> > >  > >
> > >  > > On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout <
> > >  kayousterh...@gmail.com>
> > >  > wrote:
> > >  > >> Hi all,
> > >  > >>
> > >  > >> The Spark Style Guide
> > >  > >> <
> > >  >
> > >
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
> > >  >
> > >  > >> says multi-line comments should formatted as:
> > >  > >>
> > >  > >> /*
> > >  > >>  * This is a
> > >  > >>  * very
> > >  > >>  * long comment.
> > >  >

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-08 Thread Sandy Ryza
+1 (non-binding, doc and packaging issues aside)

Built from source, ran jobs and spark-shell against a pseudo-distributed
YARN cluster.

On Sun, Mar 8, 2015 at 2:42 PM, Krishna Sankar  wrote:

> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
> Distributions X ...
>
> May be one option is to have a minimum basic set (which I know is what we
> are discussing) and move the rest to spark-packages.org. There the vendors
> can add the latest downloads - for example when 1.4 is released, HDP can
> build a release of HDP Spark 1.4 bundle.
>
> Cheers
> 
>
> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell 
> wrote:
>
> > We probably want to revisit the way we do binaries in general for
> > 1.4+. IMO, something worth forking a separate thread for.
> >
> > I've been hesitating to add new binaries because people
> > (understandably) complain if you ever stop packaging older ones, but
> > on the other hand the ASF has complained that we have too many
> > binaries already and that we need to pare it down because of the large
> > volume of files. Doubling the number of binaries we produce for Scala
> > 2.11 seemed like it would be too much.
> >
> > One solution potentially is to actually package "Hadoop provided"
> > binaries and encourage users to use these by simply setting
> > HADOOP_HOME, or have instructions for specific distros. I've heard
> > that our existing packages don't work well on HDP for instance, since
> > there are some configuration quirks that differ from the upstream
> > Hadoop.
> >
> > If we cut down on the cross building for Hadoop versions, then it is
> > more tenable to cross build for Scala versions without exploding the
> > number of binaries.
> >
> > - Patrick
> >
> > On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen  wrote:
> > > Yeah, interesting question of what is the better default for the
> > > single set of artifacts published to Maven. I think there's an
> > > argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
> > > and cons discussed more at
> > >
> > > https://issues.apache.org/jira/browse/SPARK-5134
> > > https://github.com/apache/spark/pull/3917
> > >
> > > On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia  >
> > wrote:
> > >> +1
> > >>
> > >> Tested it on Mac OS X.
> > >>
> > >> One small issue I noticed is that the Scala 2.11 build is using Hadoop
> > 1 without Hive, which is kind of weird because people will more likely
> want
> > Hadoop 2 with Hive. So it would be good to publish a build for that
> > configuration instead. We can do it if we do a new RC, or it might be
> that
> > binary builds may not need to be voted on (I forgot the details there).
> > >>
> > >> Matei
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>


Re: Directly broadcasting (sort of) RDDs

2015-03-22 Thread Sandy Ryza
Hi Guillaume,

I've long thought something like this would be useful - i.e. the ability to
broadcast RDDs directly without first pulling data through the driver.  If
I understand correctly, your requirement to "block" a matrix up and only
fetch the needed parts could be implemented on top of this by splitting an
RDD into a set of smaller RDDs and then broadcasting each one on its own.

Unfortunately nobody is working on this currently (and I couldn't promise
to have bandwidth to review it at the moment either), but I suspect we'll
eventually need to add something like this for map joins in Hive on Spark
and Spark SQL.

-Sandy



On Sat, Mar 21, 2015 at 3:11 AM, Guillaume Pitel  wrote:

>  Hi,
>
> Thanks for your answer. This is precisely the use case I'm interested in,
> but I know it already, I should have mentionned it. Unfortunately this
> implementation of BlockMatrix has (in my opinion) some disadvantages (the
> fact that it split the matrix by range instead of using a modulo is bad for
> block skewness). Besides, and more importantly, as I was writing, it uses
> the join solution (actually a cogroup :
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala,
> line 361). The reduplication of the elements of the dense matrix is thus
> dependent on the block size.
>
> Actually I'm wondering if what I want to achieve could be made with a
> simple modification to the join, allowing a partition to be weakly cached
> wafter being retrieved.
>
> Guillaume
>
>
>  There is block matrix in Spark 1.3 - 
> http://spark.apache.org/docs/latest/mllib-data-types.html#blockmatrix
>
>
>
>
>
> However I believe it only supports dense matrix blocks.
>
>
>
>
> Still, might be possible to use it or exetend
>
>
>
>
> JIRAs:
>
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3434
>
>
>
>
>
> Was based on
>
> https://github.com/amplab/ml-matrix
>
>
>
>
>
> Another lib:
>
> https://github.com/PasaLab/marlin/blob/master/README.md
>
>
>
>
>
>
>
> —
> Sent from Mailbox
>
> On Sat, Mar 21, 2015 at 12:24 AM, Guillaume Pitel 
>  wrote:
>
>
>  Hi,
> I have an idea that I would like to discuss with the Spark devs. The
> idea comes from a very real problem that I have struggled with since
> almost a year. My problem is very simple, it's a dense matrix * sparse
> matrix  operation. I have a dense matrix RDD[(Int,FloatMatrix)] which is
> divided in X large blocks (one block per partition), and a sparse matrix
> RDD[((Int,Int),Array[Array[(Int,Float)]]] , divided in X * Y blocks. The
> most efficient way to perform the operation is to collectAsMap() the
> dense matrix and broadcast it, then perform the block-local
> mutliplications, and combine the results by column.
> This is quite fine, unless the matrix is too big to fit in memory
> (especially since the multiplication is performed several times
> iteratively, and the broadcasts are not always cleaned from memory as I
> would naively expect).
> When the dense matrix is too big, a second solution is to split the big
> sparse matrix in several RDD, and do several broadcasts. Doing this
> creates quite a big overhead, but it mostly works, even though I often
> face some problems with unaccessible broadcast files, for instance.
> Then there is the terrible but apparently very effective good old join.
> Since X blocks of the sparse matrix use the same block from the dense
> matrix, I suspect that the dense matrix is somehow replicated X times
> (either on disk or in the network), which is the reason why the join
> takes so much time.
> After this bit of a context, here is my idea : would it be possible to
> somehow "broadcast" (or maybe more accurately, share or serve) a
> persisted RDD which is distributed on all workers, in a way that would,
> a bit like the IndexedRDD, allow a task to access a partition or an
> element of a partition in the closure, with a worker-local memory cache
> . i.e. the information about where each block resides would be
> distributed on the workers, to allow them to access parts of the RDD
> directly. I think that's already a bit how RDD are shuffled ?
> The RDD could stay distributed (no need to collect then broadcast), and
> only necessary transfers would be required.
> Is this a bad idea, is it already implemented somewhere (I would love it
> !) ?or is it something that could add efficiency not only for my use
> case, but maybe for others ? Could someone give me some hint about how I
> could add this possibility to Spark ? I would probably try to extend a
> RDD into a specific SharedIndexedRDD with a special lookup that would be
> allowed from tasks as a special case, and that would try to contact the
> blockManager and reach the corresponding data from the right worker.
> Thanks in advance for your advices
> Guillaume
> --
> eXenSa
>   
> *Guillaume PITEL, Président*
> +33(0)626 222 431
> eXenSa S.A.S.  
> 41, rue Péri

Re: Spark Executor resources

2015-03-24 Thread Sandy Ryza
Hi Zoltan,

If running on YARN, the YARN NodeManager starts executors.  I don't think
there's a 100% precise way for the Spark executor way to know how many
resources are allotted to it.  It can come close by looking at the Spark
configuration options used to request it (spark.executor.memory and
spark.yarn.executor.memoryOverhead), but it can't necessarily for the
amount that YARN has rounded up if those configuration properties
(yarn.scheduler.minimum-allocation-mb and
yarn.scheduler.increment-allocation-mb) are not present on the node.

-Sandy

-Sandy

On Mon, Mar 23, 2015 at 5:08 PM, Zoltán Zvara 
wrote:

> Let's say I'm an Executor instance in a Spark system. Who started me and
> where, when I run on a worker node supervised by (a) Mesos, (b) YARN? I
> suppose I'm the only one Executor on a worker node for a given framework
> scheduler (driver). If I'm an Executor instance, who is the closest object
> to me who can tell me how many resources do I have on (a) Mesos, (b) YARN?
>
> Thank you for your kind input!
>
> Zvara Zoltán
>
>
>
> mail, hangout, skype: zoltan.zv...@gmail.com
>
> mobile, viber: +36203129543
>
> bank: 10918001-0021-50480008
>
> address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a
>
> elte: HSKSJZ (ZVZOAAI.ELTE)
>


Re: Spark Executor resources

2015-03-24 Thread Sandy Ryza
That's correct.  What's the reason this information is needed?

-Sandy

On Tue, Mar 24, 2015 at 11:41 AM, Zoltán Zvara 
wrote:

> Thank you for your response!
>
> I guess the (Spark)AM, who gives the container leash to the NM (along with
> the executor JAR and command to run) must know how many CPU or RAM that
> container capped, isolated at. There must be a resource vector along the
> encrypted container leash if I'm right that describes this. Or maybe is
> there a way for the ExecutorBackend to fetch this information directly from
> the environment? Then, the ExecutorBackend would be able to hand over this
> information to the actual Executor who creates the TaskRunner.
>
> Zvara Zoltán
>
>
>
> mail, hangout, skype: zoltan.zv...@gmail.com
>
> mobile, viber: +36203129543
>
> bank: 10918001-0021-50480008
>
> address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a
>
> elte: HSKSJZ (ZVZOAAI.ELTE)
>
> 2015-03-24 16:30 GMT+01:00 Sandy Ryza :
>
>> Hi Zoltan,
>>
>> If running on YARN, the YARN NodeManager starts executors.  I don't think
>> there's a 100% precise way for the Spark executor way to know how many
>> resources are allotted to it.  It can come close by looking at the Spark
>> configuration options used to request it (spark.executor.memory and
>> spark.yarn.executor.memoryOverhead), but it can't necessarily for the
>> amount that YARN has rounded up if those configuration properties
>> (yarn.scheduler.minimum-allocation-mb and
>> yarn.scheduler.increment-allocation-mb) are not present on the node.
>>
>> -Sandy
>>
>> -Sandy
>>
>> On Mon, Mar 23, 2015 at 5:08 PM, Zoltán Zvara 
>> wrote:
>>
>>> Let's say I'm an Executor instance in a Spark system. Who started me and
>>> where, when I run on a worker node supervised by (a) Mesos, (b) YARN? I
>>> suppose I'm the only one Executor on a worker node for a given framework
>>> scheduler (driver). If I'm an Executor instance, who is the closest
>>> object
>>> to me who can tell me how many resources do I have on (a) Mesos, (b)
>>> YARN?
>>>
>>> Thank you for your kind input!
>>>
>>> Zvara Zoltán
>>>
>>>
>>>
>>> mail, hangout, skype: zoltan.zv...@gmail.com
>>>
>>> mobile, viber: +36203129543
>>>
>>> bank: 10918001-0021-50480008
>>>
>>> address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a
>>>
>>> elte: HSKSJZ (ZVZOAAI.ELTE)
>>>
>>
>>
>


Re: hadoop input/output format advanced control

2015-03-25 Thread Sandy Ryza
Regarding Patrick's question, you can just do "new Configuration(oldConf)"
to get a cloned Configuration object and add any new properties to it.

-Sandy

On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid  wrote:

> Hi Nick,
>
> I don't remember the exact details of these scenarios, but I think the user
> wanted a lot more control over how the files got grouped into partitions,
> to group the files together by some arbitrary function.  I didn't think
> that was possible w/ CombineFileInputFormat, but maybe there is a way?
>
> thanks
>
> On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath 
> wrote:
>
> > Imran, on your point to read multiple files together in a partition, is
> it
> > not simpler to use the approach of copy Hadoop conf and set per-RDD
> > settings for min split to control the input size per partition, together
> > with something like CombineFileInputFormat?
> >
> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
> > wrote:
> >
> > > I think this would be a great addition, I totally agree that you need
> to
> > be
> > > able to set these at a finer context than just the SparkContext.
> > >
> > > Just to play devil's advocate, though -- the alternative is for you
> just
> > > subclass HadoopRDD yourself, or make a totally new RDD, and then you
> > could
> > > expose whatever you need.  Why is this solution better?  IMO the
> criteria
> > > are:
> > > (a) common operations
> > > (b) error-prone / difficult to implement
> > > (c) non-obvious, but important for performance
> > >
> > > I think this case fits (a) & (c), so I think its still worthwhile.  But
> > its
> > > also worth asking whether or not its too difficult for a user to extend
> > > HadoopRDD right now.  There have been several cases in the past week
> > where
> > > we've suggested that a user should read from hdfs themselves (eg., to
> > read
> > > multiple files together in one partition) -- with*out* reusing the code
> > in
> > > HadoopRDD, though they would lose things like the metric tracking &
> > > preferred locations you get from HadoopRDD.  Does HadoopRDD need to
> some
> > > refactoring to make that easier to do?  Or do we just need a good
> > example?
> > >
> > > Imran
> > >
> > > (sorry for hijacking your thread, Koert)
> > >
> > >
> > >
> > > On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers 
> > wrote:
> > >
> > > > see email below. reynold suggested i send it to dev instead of user
> > > >
> > > > -- Forwarded message --
> > > > From: Koert Kuipers 
> > > > Date: Mon, Mar 23, 2015 at 4:36 PM
> > > > Subject: hadoop input/output format advanced control
> > > > To: "u...@spark.apache.org" 
> > > >
> > > >
> > > > currently its pretty hard to control the Hadoop Input/Output formats
> > used
> > > > in Spark. The conventions seems to be to add extra parameters to all
> > > > methods and then somewhere deep inside the code (for example in
> > > > PairRDDFunctions.saveAsHadoopFile) all these parameters get
> translated
> > > into
> > > > settings on the Hadoop Configuration object.
> > > >
> > > > for example for compression i see "codec: Option[Class[_ <:
> > > > CompressionCodec]] = None" added to a bunch of methods.
> > > >
> > > > how scalable is this solution really?
> > > >
> > > > for example i need to read from a hadoop dataset and i dont want the
> > > input
> > > > (part) files to get split up. the way to do this is to set
> > > > "mapred.min.split.size". now i dont want to set this at the level of
> > the
> > > > SparkContext (which can be done), since i dont want it to apply to
> > input
> > > > formats in general. i want it to apply to just this one specific
> input
> > > > dataset i need to read. which leaves me with no options currently. i
> > > could
> > > > go add yet another input parameter to all the methods
> > > > (SparkContext.textFile, SparkContext.hadoopFile,
> > SparkContext.objectFile,
> > > > etc.). but that seems ineffective.
> > > >
> > > > why can we not expose a Map[String, String] or some other generic way
> > to
> > > > manipulate settings for hadoop input/output formats? it would require
> > > > adding one more parameter to all methods to deal with hadoop
> > input/output
> > > > formats, but after that its done. one parameter to rule them all
> > > >
> > > > then i could do:
> > > > val x = sc.textFile("/some/path", formatSettings =
> > > > Map("mapred.min.split.size" -> "12345"))
> > > >
> > > > or
> > > > rdd.saveAsTextFile("/some/path, formatSettings =
> > > > Map(mapred.output.compress" -> "true",
> > "mapred.output.compression.codec"
> > > ->
> > > > "somecodec"))
> > > >
> > >
> >
>


Re: RDD.count

2015-03-28 Thread Sandy Ryza
I definitely see the value in this.  However, I think at this point it
would be an incompatible behavioral change.  People often use count in
Spark to exercise their DAG.  Omitting processing steps that were
previously included would likely mislead many users into thinking their
pipeline was running faster.

It's possible there might be room for something like a new smartCount API
or a new argument to count that allows it to avoid unnecessary
transformations.

-Sandy

On Sat, Mar 28, 2015 at 6:10 AM, Sean Owen  wrote:

> No, I'm not saying side effects change the count. But not executing
> the map() function at all certainly has an effect on the side effects
> of that function: the side effects which should take place never do. I
> am not sure that is something to be 'fixed'; it's a legitimate
> question.
>
> You can persist an RDD if you do not want to compute it twice.
>
> On Sat, Mar 28, 2015 at 1:05 PM, jimfcarroll 
> wrote:
> > Hi Sean,
> >
> > Thanks for the response.
> >
> > I can't imagine a case (though my imagination may be somewhat limited)
> where
> > even map side effects could change the number of elements in the
> resulting
> > map.
> >
> > I guess "count" wouldn't officially be an 'action' if it were implemented
> > this way. At least it wouldn't ALWAYS be one.
> >
> > My example was contrived. We're passing RDDs to functions. If that RDD
> is an
> > instance of my class, then its count() may take a shortcut. If I
> > map/zip/zipWithIndex/mapPartition/etc. first then I'm stuck with a call
> that
> > literally takes 100s to 1000s of times longer (seconds vs hours on some
> of
> > our datasets) and since my custom RDDs are immutable they cache the count
> > call so a second invocation is the cost of a method call's overhead.
> >
> > I could fix this in Spark if there's any interest in that change.
> Otherwise
> > I'll need to overload more RDD methods for my own purposes (like all of
> the
> > transformations). Of course, that will be more difficult because those
> > intermediate classes (like MappedRDD) are private, so I can't extend
> them.
> >
> > Jim
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-count-tp11298p11302.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-08 Thread Sandy Ryza
+1

Built against Hadoop 2.6 and ran some jobs against a pseudo-distributed
YARN cluster.

-Sandy

On Wed, Apr 8, 2015 at 12:49 PM, Patrick Wendell  wrote:

> Oh I see - ah okay I'm guessing it was a transient build error and
> I'll get it posted ASAP.
>
> On Wed, Apr 8, 2015 at 3:41 PM, Denny Lee  wrote:
> > Oh, it appears the 2.4 bits without hive are there but not the 2.4 bits
> with
> > hive. Cool stuff on the 2.6.
> > On Wed, Apr 8, 2015 at 12:30 Patrick Wendell  wrote:
> >>
> >> Hey Denny,
> >>
> >> I beleive the 2.4 bits are there. The 2.6 bits I had done specially
> >> (we haven't merge that into our upstream build script). I'll do it
> >> again now for RC2.
> >>
> >> - Patrick
> >>
> >> On Wed, Apr 8, 2015 at 1:53 PM, Timothy Chen  wrote:
> >> > +1 Tested on 4 nodes Mesos cluster with fine-grain and coarse-grain
> >> > mode.
> >> >
> >> > Tim
> >> >
> >> > On Wed, Apr 8, 2015 at 9:32 AM, Denny Lee 
> wrote:
> >> >> The RC2 bits are lacking Hadoop 2.4 and Hadoop 2.6 - was that
> intended
> >> >> (they were included in RC1)?
> >> >>
> >> >>
> >> >> On Wed, Apr 8, 2015 at 9:01 AM Tom Graves
> >> >> 
> >> >> wrote:
> >> >>
> >> >>> +1. Tested spark on yarn against hadoop 2.6.
> >> >>> Tom
> >> >>>
> >> >>>
> >> >>>  On Wednesday, April 8, 2015 6:15 AM, Sean Owen
> >> >>> 
> >> >>> wrote:
> >> >>>
> >> >>>
> >> >>>  Still a +1 from me; same result (except that now of course the
> >> >>> UISeleniumSuite test does not fail)
> >> >>>
> >> >>> On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell  >
> >> >>> wrote:
> >> >>> > Please vote on releasing the following candidate as Apache Spark
> >> >>> > version
> >> >>> 1.3.1!
> >> >>> >
> >> >>> > The tag to be voted on is v1.3.1-rc2 (commit 7c4473a):
> >> >>> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
> >> >>> 7c4473aa5a7f5de0323394aaedeefbf9738e8eb5
> >> >>> >
> >> >>> > The list of fixes present in this release can be found at:
> >> >>> > http://bit.ly/1C2nVPY
> >> >>> >
> >> >>> > The release files, including signatures, digests, etc. can be
> found
> >> >>> > at:
> >> >>> > http://people.apache.org/~pwendell/spark-1.3.1-rc2/
> >> >>> >
> >> >>> > Release artifacts are signed with the following key:
> >> >>> > https://people.apache.org/keys/committer/pwendell.asc
> >> >>> >
> >> >>> > The staging repository for this release can be found at:
> >> >>> >
> >> >>> >
> https://repository.apache.org/content/repositories/orgapachespark-1083/
> >> >>> >
> >> >>> > The documentation corresponding to this release can be found at:
> >> >>> > http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/
> >> >>> >
> >> >>> > The patches on top of RC1 are:
> >> >>> >
> >> >>> > [SPARK-6737] Fix memory leak in OutputCommitCoordinator
> >> >>> > https://github.com/apache/spark/pull/5397
> >> >>> >
> >> >>> > [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py
> >> >>> > https://github.com/apache/spark/pull/5302
> >> >>> >
> >> >>> > [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with
> >> >>> > NoClassDefFoundError
> >> >>> > https://github.com/apache/spark/pull/4933
> >> >>> >
> >> >>> > Please vote on releasing this package as Apache Spark 1.3.1!
> >> >>> >
> >> >>> > The vote is open until Saturday, April 11, at 07:00 UTC and passes
> >> >>> > if a majority of at least 3 +1 PMC votes are cast.
> >> >>> >
> >> >>> > [ ] +1 Release this package as Apache Spark 1.3.1
> >> >>> > [ ] -1 Do not release this package because ...
> >> >>> >
> >> >>> > To learn more about Apache Spark, please see
> >> >>> > http://spark.apache.org/
> >> >>> >
> >> >>> >
> >> >>> >
> -
> >> >>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> >>> > For additional commands, e-mail: dev-h...@spark.apache.org
> >> >>> >
> >> >>>
> >> >>>
> -
> >> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >> >>>
> >> >>>
> >> >>>
> >> >>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Using memory mapped file for shuffle

2015-04-14 Thread Sandy Ryza
Hi Kannan,

Both in MapReduce and Spark, the amount of shuffle data a task produces can
exceed the tasks memory without risk of OOM.

-Sandy

On Tue, Apr 14, 2015 at 6:47 AM, Imran Rashid  wrote:

> That limit doesn't have anything to do with the amount of available
> memory.  Its just a tuning parameter, as one version is more efficient for
> smaller files, the other is better for bigger files.  I suppose the comment
> is a little better in FileSegmentManagedBuffer:
>
>
> https://github.com/apache/spark/blob/master/network/common/src/main/java/org/apache/spark/network/buffer/FileSegmentManagedBuffer.java#L62
>
> On Tue, Apr 14, 2015 at 12:01 AM, Kannan Rajah 
> wrote:
>
> > DiskStore.getBytes uses memory mapped files if the length is more than a
> > configured limit. This code path is used during map side shuffle in
> > ExternalSorter. I want to know if its possible for the length to exceed
> the
> > limit in the case of shuffle. The reason I ask is in the case of Hadoop,
> > each map task is supposed to produce only data that can fit within the
> > task's configured max memory. Otherwise it will result in OOM. Is the
> > behavior same in Spark or the size of data generated by a map task can
> > exceed what can be fitted in memory.
> >
> >   if (length < minMemoryMapBytes) {
> > val buf = ByteBuffer.allocate(length.toInt)
> > 
> >   } else {
> > Some(channel.map(MapMode.READ_ONLY, offset, length))
> >   }
> >
> > --
> > Kannan
> >
>


Re: Should we let everyone set Assignee?

2015-04-22 Thread Sandy Ryza
I think one of the benefits of assignee fields that I've seen in other
projects is their potential to coordinate and prevent duplicate work.  It's
really frustrating to put a lot of work into a patch and then find out that
someone has been doing the same.  It's helpful for the project etiquette to
include a way to signal to others that you are working or intend to work on
a patch.  Obviously there are limits to how long someone should be able to
hold on to a JIRA without making progress on it, but a signal is still
useful.  Historically, in other projects, the assignee field serves as this
signal.  If we don't want to use the assignee field for this, I think it's
important to have some alternative, even if it's just encouraging
contributors to comment "I'm planning to work on this" on JIRA.

-Sandy



On Wed, Apr 22, 2015 at 1:30 PM, Ganelin, Ilya 
wrote:

> As a contributor, I¹ve never felt shut out from the Spark community, nor
> have I seen any examples of territorial behavior. A few times I¹ve
> expressed interest in more challenging work and the response I received
> was generally ³go ahead and give it a shot, just understand that this is
> sensitive code so we may end up modifying the PR substantially.² Honestly,
> that seems fine, and in general, I think it¹s completely fair to go with
> the PR model - e.g. If a JIRA has an open PR then it¹s an active effort,
> otherwise it¹s fair game unless otherwise stated. At the end of the day,
> it¹s about moving the project forward and the only way to do that is to
> have actual code in the pipes -speculation and intent don¹t really help,
> and there¹s nothing preventing an interested party from submitting a PR
> against an issue.
>
> Thank you,
> Ilya Ganelin
>
>
>
>
>
>
> On 4/22/15, 1:25 PM, "Mark Hamstra"  wrote:
>
> >Agreed.  The Spark project and community that Vinod describes do not
> >resemble the ones with which I am familiar.
> >
> >On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell 
> >wrote:
> >
> >> Hi Vinod,
> >>
> >> Thanks for you thoughts - However, I do not agree with your sentiment
> >> and implications. Spark is broadly quite an inclusive project and we
> >> spend a lot of effort culturally to help make newcomers feel welcome.
> >>
> >> - Patrick
> >>
> >> On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli
> >>  wrote:
> >> > Actually what this community got away with is pretty much an
> >> anti-pattern compared to every other Apache project I have seen. And
> >>may I
> >> say in a not so Apache way.
> >> >
> >> > Waiting for a committer to assign a patch to someone leaves it as a
> >> privilege to a committer. Not alluding to anything fishy in practice,
> >>but
> >> this also leaves a lot of open ground for self-interest. Committers
> >> defining notions of good fit / level of experience do not work, highly
> >> subjective and lead to group control.
> >> >
> >> > In terms of semantics, here is what most other projects (dare I say
> >> every Apache project?) that I have seen do
> >> >  - A new contributor comes in who is not yet added to the JIRA
> >>project.
> >> He/she requests one of the project's JIRA admins to add him/her.
> >> >  - After that, he or she is free to assign tickets to themselves.
> >> >  - What this means
> >> > -- Assigning a ticket to oneself is a signal to the rest of the
> >> community that he/she is actively working on the said patch.
> >> > -- If multiple contributors want to work on the same patch, it
> >>needs
> >> to resolved amicably through open communication. On JIRA, or on mailing
> >> lists. Not by the whim of a committer.
> >> >  - Common issues
> >> > -- Land grabbing: Other contributors can nudge him/her in case of
> >> inactivity and take them over. Again, amicably instead of a committer
> >> making subjective decisions.
> >> > -- Progress stalling: One contributor assigns the ticket to
> >> himself/herself is actively debating but with no real code/docs
> >> contribution or with any real intention of making progress. Here
> >>workable,
> >> reviewable code for review usually wins.
> >> >
> >> > Assigning patches is not a privilege. Contributors at Apache are a
> >>bunch
> >> of volunteers, the PMC should let volunteers contribute as they see
> >>fit. We
> >> do not assign work at Apache.
> >> >
> >> > +Vinod
> >> >
> >> > On Apr 22, 2015, at 12:32 PM, Patrick Wendell 
> >> wrote:
> >> >
> >> >> One over arching issue is that it's pretty unclear what "Assigned to
> >> >> X" in JIAR means from a process perspective. Personally I actually
> >> >> feel it's better for this to be more historical - i.e. who ended up
> >> >> submitting a patch for this feature that was merged - rather than
> >> >> creating an exclusive reservation for a particular user to work on
> >> >> something.
> >> >>
> >> >> If an issue is "assigned" to person X, but some other person Y
> >>submits
> >> >> a great patch for it, I think we have some obligation to Spark users
> >> >> and to the community to merge the 

Re: Design docs: consolidation and discoverability

2015-04-24 Thread Sandy Ryza
I think there are maybe two separate things we're talking about?

1. Design discussions and in-progress design docs.

My two cents are that JIRA is the best place for this.  It allows tracking
the progression of a design across multiple PRs and contributors.  A piece
of useful feedback that I've gotten in the past is to make design docs
immutable.  When updating them in response to feedback, post a new version
rather than editing the existing one.  This enables tracking the history of
a design and makes it possible to read comments about previous designs in
context.  Otherwise it's really difficult to understand why particular
approaches were chosen or abandoned.

2. Completed design docs for features that we've implemented.

Perhaps less essential to project progress, but it would be really lovely
to have a central repository to all the projects design doc.  If anyone
wants to step up to maintain it, it would be cool to have a wiki page with
links to all the final design docs posted on JIRA.

-Sandy

On Fri, Apr 24, 2015 at 12:01 PM, Punyashloka Biswal  wrote:

> The Gradle dev team keep their design documents  *checked into* their Git
> repository -- see
>
> https://github.com/gradle/gradle/blob/master/design-docs/build-comparison.md
> for example. The advantages I see to their approach are:
>
>- design docs stay on ASF property (since Github is synced to the
>Apache-run Git repository)
>- design docs have a lifetime across PRs, but can still be modified and
>commented on through the mechanism of PRs
>- keeping a central location helps people to find good role models and
>converge on conventions
>
> Sean, I find it hard to use the central Jira as a jumping-off point for
> understanding ongoing design work because a tiny fraction of the tickets
> actually relate to design docs, and it's not easy from the outside to
> figure out which ones are relevant.
>
> Punya
>
> On Fri, Apr 24, 2015 at 2:49 PM Sean Owen  wrote:
>
> > I think it's OK to have design discussions on github, as emails go to
> > ASF lists. After all, loads of PR discussions happen there. It's easy
> > for anyone to follow.
> >
> > I also would rather just discuss on Github, except for all that noise.
> >
> > It's not great to put discussions in something like Google Docs
> > actually; the resulting doc needs to be pasted back to JIRA promptly
> > if so. I suppose it's still better than a private conversation or not
> > talking at all, but the principle is that one should be able to access
> > any substantive decision or conversation by being tuned in to only the
> > project systems of record -- mailing list, JIRA.
> >
> >
> >
> > On Fri, Apr 24, 2015 at 2:30 PM, Reynold Xin 
> wrote:
> > > I'd love to see more design discussions consolidated in a single place
> as
> > > well. That said, there are many practical challenges to overcome. Some
> of
> > > them are out of our control:
> > >
> > > 1. For large features, it is fairly common to open a PR for discussion,
> > > close the PR taking some feedback into account, and reopen another one.
> > You
> > > sort of lose the discussions that way.
> > >
> > > 2. With the way Jenkins is setup currently, Jenkins testing introduces
> a
> > lot
> > > of noise to GitHub pull requests, making it hard to differentiate
> > legitimate
> > > comments from noise. This is unfortunately due to the fact that ASF
> won't
> > > allow our Jenkins bot to have API privilege to post messages.
> > >
> > > 3. The Apache Way is that all development discussions need to happen on
> > ASF
> > > property, i.e. dev lists and JIRA. As a result, technically we are not
> > > allowed to have development discussions on GitHub.
> > >
> > >
> > > On Fri, Apr 24, 2015 at 7:09 AM, Cody Koeninger 
> > wrote:
> > >>
> > >> My 2 cents - I'd rather see design docs in github pull requests (using
> > >> plain text / markdown).  That doesn't require changing access or
> adding
> > >> people, and github PRs already allow for conversation / email
> > >> notifications.
> > >>
> > >> Conversation is already split between jira and github PRs.  Having a
> > third
> > >> stream of conversation in Google Docs just leads to things being
> > ignored.
> > >>
> > >> On Fri, Apr 24, 2015 at 7:21 AM, Sean Owen 
> wrote:
> > >>
> > >> > That would require giving wiki access to everyone or manually adding
> > >> > people
> > >> > any time they make a doc.
> > >> >
> > >> > I don't see how this helps though. They're still docs on the
> internet
> > >> > and
> > >> > they're still linked from the central project JIRA, which is what
> you
> > >> > should follow.
> > >> >  On Apr 24, 2015 8:14 AM, "Punyashloka Biswal" <
> > punya.bis...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Dear Spark devs,
> > >> > >
> > >> > > Right now, design docs are stored on Google docs and linked from
> > >> > > tickets.
> > >> > > For someone new to the project, it's hard to figure out what
> > subjects
> > >> > > are
> > >> > > being discussed, what or

Re: Design docs: consolidation and discoverability

2015-04-27 Thread Sandy Ryza
#x27;d
> >>>> pollute our git history a lot with random incremental design updates.
> >>>>
> >>>> The git history is used a lot by downstream packagers, us during our
> >>>> QA process, etc... we really try to keep it oriented around code
> >>>> patches:
> >>>>
> >>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=shortlog
> >>>>
> >>>> Committing a polished design doc along with a feature, maybe that's
> >>>> something we could consider. But I still think JIRA is the best
> >>>> location for these docs, consistent with what most other ASF projects
> >>>> do that I know.
> >>>>
> >>>> On Fri, Apr 24, 2015 at 1:19 PM, Cody Koeninger 
> >>>> wrote:
> >>>>> Why can't pull requests be used for design docs in Git if people who
> >>>>> aren't
> >>>>> committers want to contribute changes (as opposed to just comments)?
> >>>>>
> >>>>> On Fri, Apr 24, 2015 at 2:57 PM, Sean Owen 
> wrote:
> >>>>>
> >>>>>> Only catch there is it requires commit access to the repo. We need a
> >>>>>> way for people who aren't committers to write and collaborate (for
> >>>>>> point #1)
> >>>>>>
> >>>>>> On Fri, Apr 24, 2015 at 3:56 PM, Punyashloka Biswal
> >>>>>>  wrote:
> >>>>>>> Sandy, doesn't keeping (in-progress) design docs in Git satisfy the
> >>>>>> history
> >>>>>>> requirement? Referring back to my Gradle example, it seems that
> >>>>>>>
> >>>>>>
> >>>>>>
> https://github.com/gradle/gradle/commits/master/design-docs/build-comparison.md
> >>>>>>> is a really good way to see why the design doc evolved the way it
> >>>>>>> did.
> >>>>>> When
> >>>>>>> keeping the doc in Jira (presumably as an attachment) it's not easy
> >>>>>>> to
> >>>>>> see
> >>>>>>> what changed between successive versions of the doc.
> >>>>>>>
> >>>>>>> Punya
> >>>>>>>
> >>>>>>> On Fri, Apr 24, 2015 at 3:53 PM Sandy Ryza <
> sandy.r...@cloudera.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>> I think there are maybe two separate things we're talking about?
> >>>>>>>>
> >>>>>>>> 1. Design discussions and in-progress design docs.
> >>>>>>>>
> >>>>>>>> My two cents are that JIRA is the best place for this.  It allows
> >>>>>> tracking
> >>>>>>>> the progression of a design across multiple PRs and
> contributors.  A
> >>>>>> piece
> >>>>>>>> of useful feedback that I've gotten in the past is to make design
> >>>>>>>> docs
> >>>>>>>> immutable.  When updating them in response to feedback, post a new
> >>>>>> version
> >>>>>>>> rather than editing the existing one.  This enables tracking the
> >>>>>> history of
> >>>>>>>> a design and makes it possible to read comments about previous
> >>>>>>>> designs
> >>>>>> in
> >>>>>>>> context.  Otherwise it's really difficult to understand why
> >>>>>>>> particular
> >>>>>>>> approaches were chosen or abandoned.
> >>>>>>>>
> >>>>>>>> 2. Completed design docs for features that we've implemented.
> >>>>>>>>
> >>>>>>>> Perhaps less essential to project progress, but it would be really
> >>>>>> lovely
> >>>>>>>> to have a central repository to all the projects design doc.  If
> >>>>>>>> anyone
> >>>>>>>> wants to step up to maintain it, it would be cool to have a wiki
> >>>>>>>> page
> >>>>>> with
> >>>>>>>> links to all the final design docs posted on JIRA.
> >>>>>>>>
> >>>>>>
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Using memory mapped file for shuffle

2015-04-29 Thread Sandy Ryza
Spark currently doesn't allocate any memory off of the heap for shuffle
objects.  When the in-memory data gets too large, it will write it out to a
file, and then merge spilled filed later.

What exactly do you mean by store shuffle data in HDFS?

-Sandy

On Tue, Apr 14, 2015 at 10:15 AM, Kannan Rajah  wrote:

> Sandy,
> Can you clarify how it won't cause OOM? Is it anyway to related to memory
> being allocated outside the heap - native space? The reason I ask is that I
> have a use case to store shuffle data in HDFS. Since there is no notion of
> memory mapped files, I need to store it as a byte buffer. I want to make
> sure this will not cause OOM when the file size is large.
>
>
> --
> Kannan
>
> On Tue, Apr 14, 2015 at 9:07 AM, Sandy Ryza 
> wrote:
>
>> Hi Kannan,
>>
>> Both in MapReduce and Spark, the amount of shuffle data a task produces
>> can exceed the tasks memory without risk of OOM.
>>
>> -Sandy
>>
>> On Tue, Apr 14, 2015 at 6:47 AM, Imran Rashid 
>> wrote:
>>
>>> That limit doesn't have anything to do with the amount of available
>>> memory.  Its just a tuning parameter, as one version is more efficient
>>> for
>>> smaller files, the other is better for bigger files.  I suppose the
>>> comment
>>> is a little better in FileSegmentManagedBuffer:
>>>
>>>
>>> https://github.com/apache/spark/blob/master/network/common/src/main/java/org/apache/spark/network/buffer/FileSegmentManagedBuffer.java#L62
>>>
>>> On Tue, Apr 14, 2015 at 12:01 AM, Kannan Rajah 
>>> wrote:
>>>
>>> > DiskStore.getBytes uses memory mapped files if the length is more than
>>> a
>>> > configured limit. This code path is used during map side shuffle in
>>> > ExternalSorter. I want to know if its possible for the length to
>>> exceed the
>>> > limit in the case of shuffle. The reason I ask is in the case of
>>> Hadoop,
>>> > each map task is supposed to produce only data that can fit within the
>>> > task's configured max memory. Otherwise it will result in OOM. Is the
>>> > behavior same in Spark or the size of data generated by a map task can
>>> > exceed what can be fitted in memory.
>>> >
>>> >   if (length < minMemoryMapBytes) {
>>> > val buf = ByteBuffer.allocate(length.toInt)
>>> > 
>>> >   } else {
>>> > Some(channel.map(MapMode.READ_ONLY, offset, length))
>>> >   }
>>> >
>>> > --
>>> > Kannan
>>> >
>>>
>>
>>
>


Re: Regarding KryoSerialization in Spark

2015-04-30 Thread Sandy Ryza
Hi Twinkle,

Registering the class makes it so that writeClass only writes out a couple
bytes, instead of a full String of the class name.

-Sandy

On Thu, Apr 30, 2015 at 4:13 AM, twinkle sachdeva <
twinkle.sachd...@gmail.com> wrote:

> Hi,
>
> As per the code, KryoSerialization used writeClassAndObject method, which
> internally calls writeClass method, which will write the class of the
> object while serilization.
>
> As per the documentation in tuning page of spark, it says that registering
> the class will avoid that.
>
> Am I missing something or there is some issue with the documentation???
>
> Thanks,
> Twinkle
>


Re: YARN mode startup takes too long (10+ secs)

2015-05-11 Thread Sandy Ryza
Wow, I hadn't noticed this, but 5 seconds is really long.  It's true that
it's configurable, but I think we need to provide a decent out-of-the-box
experience.  For comparison, the MapReduce equivalent is 1 second.

I filed https://issues.apache.org/jira/browse/SPARK-7533 for this.

-Sandy

On Mon, May 11, 2015 at 9:03 AM, Mridul Muralidharan 
wrote:

> For tiny/small clusters (particularly single tenet), you can set it to
> lower value.
> But for anything reasonably large or multi-tenet, the request storm
> can be bad if large enough number of applications start aggressively
> polling RM.
> That is why the interval is set to configurable.
>
> - Mridul
>
>
> On Mon, May 11, 2015 at 6:54 AM, Zoltán Zvara 
> wrote:
> > Isn't this issue something that should be improved? Based on the
> following
> > discussion, there are two places were YARN's heartbeat interval is
> > respected on job start-up, but do we really need to respect it on
> start-up?
> >
> > On Fri, May 8, 2015 at 12:14 PM Taeyun Kim 
> > wrote:
> >
> >> I think so.
> >>
> >> In fact, the flow is: allocator.allocateResources() -> sleep ->
> >> allocator.allocateResources() -> sleep …
> >>
> >> But I guess that on the first allocateResources() the allocation is not
> >> fulfilled. So sleep occurs.
> >>
> >>
> >>
> >> *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com]
> >> *Sent:* Friday, May 08, 2015 4:25 PM
> >>
> >>
> >> *To:* Taeyun Kim; u...@spark.apache.org
> >> *Subject:* Re: YARN mode startup takes too long (10+ secs)
> >>
> >>
> >>
> >> So is this sleep occurs before allocating resources for the first few
> >> executors to start the job?
> >>
> >>
> >>
> >> On Fri, May 8, 2015 at 6:23 AM Taeyun Kim 
> >> wrote:
> >>
> >> I think I’ve found the (maybe partial, but major) reason.
> >>
> >>
> >>
> >> It’s between the following lines, (it’s newly captured, but essentially
> >> the same place that Zoltán Zvara picked:
> >>
> >>
> >>
> >> 15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager
> >>
> >> 15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor:
> >> Actor[akka.tcp://sparkExecutor@cluster04
> :55237/user/Executor#-149550753]
> >> with ID 1
> >>
> >>
> >>
> >> When I read the logs on cluster side, the following lines were found:
> (the
> >> exact time is different with above line, but it’s the difference between
> >> machines)
> >>
> >>
> >>
> >> 15/05/08 11:36:23 INFO yarn.ApplicationMaster: Started progress reporter
> >> thread - sleep time : 5000
> >>
> >> 15/05/08 11:36:28 INFO impl.AMRMClientImpl: Received new token for :
> >> cluster04:45454
> >>
> >>
> >>
> >> It seemed that Spark deliberately sleeps 5 secs.
> >>
> >> I’ve read the Spark source code, and in
> >> org.apache.spark.deploy.yarn.ApplicationMaster.scala,
> launchReporterThread()
> >> had the code for that.
> >>
> >> It loops calling allocator.allocateResources() and Thread.sleep().
> >>
> >> For sleep, it reads the configuration variable
> >> spark.yarn.scheduler.heartbeat.interval-ms (the default value is 5000,
> >> which is 5 secs).
> >>
> >> According to the comment, “we want to be reasonably responsive without
> >> causing too many requests to RM”.
> >>
> >> So, unless YARN immediately fulfill the allocation request, it seems
> that
> >> 5 secs will be wasted.
> >>
> >>
> >>
> >> When I modified the configuration variable to 1000, it only waited for 1
> >> sec.
> >>
> >>
> >>
> >> Here is the log lines after the change:
> >>
> >>
> >>
> >> 15/05/08 11:47:21 INFO yarn.ApplicationMaster: Started progress reporter
> >> thread - sleep time : 1000
> >>
> >> 15/05/08 11:47:22 INFO impl.AMRMClientImpl: Received new token for :
> >> cluster04:45454
> >>
> >>
> >>
> >> 4 secs saved.
> >>
> >> So, when one does not want to wait 5 secs, one can change the
> >> spark.yarn.scheduler.heartbeat.interval-ms.
> >>
> >> I hope that the additional overhead it incurs would be negligible.
> >>
> >>
> >>
> >>
> >>
> >> *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com]
> >> *Sent:* Thursday, May 07, 2015 10:05 PM
> >> *To:* Taeyun Kim; u...@spark.apache.org
> >> *Subject:* Re: YARN mode startup takes too long (10+ secs)
> >>
> >>
> >>
> >> Without considering everything, just a few hints:
> >>
> >> You are running on YARN. From 09:18:34 to 09:18:37 your application is
> in
> >> state ACCEPTED. There is a noticeable overhead introduced due to
> >> communicating with YARN's ResourceManager, NodeManager and given that
> the
> >> YARN scheduler needs time to make a decision. I guess somewhere
> >> from 09:18:38 to 09:18:43 your application JAR gets copied to another
> >> container requested by the Spark ApplicationMaster deployed on YARN's
> >> container 0. Deploying an executor needs further resource negotiations
> with
> >> the ResourceManager usually. Also, as I said, your JAR and Executor's
> code
> >> requires copying to the container's local directory - execution blocked
> >> until that is complete.
> >>
> >>
> >>
> >> On Thu, May 7, 2015 at 3:09 AM Ta

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-05-31 Thread Sandy Ryza
+1 (non-binding)

Launched against a pseudo-distributed YARN cluster running Hadoop 2.6.0 and
ran some jobs.

-Sandy

On Sat, May 30, 2015 at 3:44 PM, Krishna Sankar  wrote:

> +1 (non-binding, of course)
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min
>  mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
> -Dhadoop.version=2.6.0 -DskipTests
> 2. Tested pyspark, mlib - running as well as compare results with 1.3.1
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Laso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>Center And Scale OK
> 2.5. RDD operations OK
>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>Model evaluation/optimization (rank, numIter, lambda) with
> itertools OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWithSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 3.6. saveAsParquetFile OK
> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> registerTempTable, sql OK
> 3.8. result = sqlContext.sql("SELECT
> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> 4.0. Spark SQL from Python OK
> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
>
> Cheers
> 
>
> On Fri, May 29, 2015 at 4:40 PM, Patrick Wendell 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.4.0!
>>
>> The tag to be voted on is v1.4.0-rc3 (commit dd109a8):
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=dd109a8746ec07c7c83995890fc2c0cd7a693730
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> [published as version: 1.4.0]
>> https://repository.apache.org/content/repositories/orgapachespark-1109/
>> [published as version: 1.4.0-rc3]
>> https://repository.apache.org/content/repositories/orgapachespark-1110/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.4.0!
>>
>> The vote is open until Tuesday, June 02, at 00:32 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.4.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == What has changed since RC1 ==
>> Below is a list of bug fixes that went into this RC:
>> http://s.apache.org/vN
>>
>> == How can I help test this release? ==
>> If you are a Spark user, you can help us test this release by
>> taking a Spark 1.3 workload and running on this release candidate,
>> then reporting any regressions.
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening towards the end of the 1.4 QA period,
>> so -1 votes should only occur for significant regressions from 1.3.1.
>> Bugs already present in 1.3.X, minor regressions, or bugs related
>> to new features will not block this release.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-05 Thread Sandy Ryza
+1 (non-binding)

Built from source and ran some jobs against a pseudo-distributed YARN
cluster.

-Sandy

On Fri, Jun 5, 2015 at 11:05 AM, Ram Sriharsha 
wrote:

> +1 , tested  with hadoop 2.6/ yarn on centos 6.5 after building  w/ -Pyarn
> -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver and ran a
> few SQL tests and the ML examples
>
> On Fri, Jun 5, 2015 at 10:55 AM, Hari Shreedharan <
> hshreedha...@cloudera.com> wrote:
>
>> +1. Build looks good, ran a couple apps on YARN
>>
>>
>> Thanks,
>> Hari
>>
>> On Fri, Jun 5, 2015 at 10:52 AM, Yin Huai  wrote:
>>
>>> Sean,
>>>
>>> Can you add "-Phive -Phive-thriftserver" and try those Hive tests?
>>>
>>> Thanks,
>>>
>>> Yin
>>>
>>> On Fri, Jun 5, 2015 at 5:19 AM, Sean Owen  wrote:
>>>
 Everything checks out again, and the tests pass for me on Ubuntu +
 Java 7 with '-Pyarn -Phadoop-2.6', except that I always get
 SparkSubmitSuite errors like ...

 - success sanity check *** FAILED ***
   java.lang.RuntimeException: [download failed:
 org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed:
 commons-net#commons-net;3.1!commons-net.jar]
   at
 org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978)
   at
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62)
   ...

 I also can't get hive tests to pass. Is anyone else seeing anything
 like this? if not I'll assume this is something specific to the env --
 or that I don't have the build invocation just right. It's puzzling
 since it's so consistent, but I presume others' tests pass and Jenkins
 does.


 On Wed, Jun 3, 2015 at 5:53 AM, Patrick Wendell 
 wrote:
 > Please vote on releasing the following candidate as Apache Spark
 version 1.4.0!
 >
 > The tag to be voted on is v1.4.0-rc3 (commit 22596c5):
 > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 > 22596c534a38cfdda91aef18aa9037ab101e4251
 >
 > The release files, including signatures, digests, etc. can be found
 at:
 >
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/
 >
 > Release artifacts are signed with the following key:
 > https://people.apache.org/keys/committer/pwendell.asc
 >
 > The staging repository for this release can be found at:
 > [published as version: 1.4.0]
 >
 https://repository.apache.org/content/repositories/orgapachespark-/
 > [published as version: 1.4.0-rc4]
 >
 https://repository.apache.org/content/repositories/orgapachespark-1112/
 >
 > The documentation corresponding to this release can be found at:
 >
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/
 >
 > Please vote on releasing this package as Apache Spark 1.4.0!
 >
 > The vote is open until Saturday, June 06, at 05:00 UTC and passes
 > if a majority of at least 3 +1 PMC votes are cast.
 >
 > [ ] +1 Release this package as Apache Spark 1.4.0
 > [ ] -1 Do not release this package because ...
 >
 > To learn more about Apache Spark, please see
 > http://spark.apache.org/
 >
 > == What has changed since RC3 ==
 > In addition to may smaller fixes, three blocker issues were fixed:
 > 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make
 > metadataHive get constructed too early
 > 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
 > 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be
 singleton
 >
 > == How can I help test this release? ==
 > If you are a Spark user, you can help us test this release by
 > taking a Spark 1.3 workload and running on this release candidate,
 > then reporting any regressions.
 >
 > == What justifies a -1 vote for this release? ==
 > This vote is happening towards the end of the 1.4 QA period,
 > so -1 votes should only occur for significant regressions from 1.3.1.
 > Bugs already present in 1.3.X, minor regressions, or bugs related
 > to new features will not block this release.
 >
 > -
 > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 > For additional commands, e-mail: dev-h...@spark.apache.org
 >

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


>>>
>>
>


Re: [SparkScore] Performance portal for Apache Spark

2015-06-17 Thread Sandy Ryza
This looks really awesome.

On Tue, Jun 16, 2015 at 10:27 AM, Huang, Jie  wrote:

>  Hi All
>
> We are happy to announce Performance portal for Apache Spark
> http://01org.github.io/sparkscore/ !
>
> The Performance Portal for Apache Spark provides performance data on the
> Spark upsteam to the community to help identify issues, better understand
> performance differentials between versions, and help Spark customers get
> across the finish line faster. The Performance Portal generates two
> reports, regular (weekly) report and release based regression test report.
> We are currently using two benchmark suites which include HiBench (
> http://github.com/intel-bigdata/HiBench) and Spark-perf (
> https://github.com/databricks/spark-perf ). We welcome and look forward
> to your suggestions and feedbacks. More information and details provided
> below
> Abount Performance Portal for Apache Spark
>
> Our goal is to work with the Apache Spark community to further enhance the
> scalability and reliability of the Apache Spark. The data available on this
> site allows community members and potential Spark customers to closely
> track performance trend of the Apache Spark. Ultimately, we hope that this
> project will help community to fix performance issue quickly, thus
> providing better Apache spark code to end customers. The current workloads
> used in the benchmarking include HiBench (a benchmark suite to evaluate big
> data framework like Hadoop MR, Spark from Intel) and Spark-perf (a
> performance testing framework for Apache Spark from Databricks). Additional
> benchmarks will be added as they become available
> Description
> --
>
> Each data point represents each workload runtime percent compared with the
> previous week. Different lines represents different workloads running on
> spark yarn-client mode.
> Hardware
> --
>
> CPU type: Intel® Xeon® CPU E5-2697 v2 @ 2.70GHz
> Memory: 128GB
> NIC: 10GbE
> Disk(s): 8 x 1TB SATA HDD
> Software
> --
>
> JAVA ver sion: 1.8.0_25
> Hadoop version: 2.5.0-CDH5.3.2
> HiBench version: 4.0
> Spark on yarn-client mode
> Cluster
> --
>
> 1 node for Master
> 10 nodes for Slave
> Summary
>
> The lower percent the better performance.
>  --
>
> *Group*
>
> *ww19 *
>
> *ww20 *
>
> *ww22 *
>
> *ww23 *
>
> *ww24 *
>
> *ww25 *
>
> HiBench
>
> 9.1%
>
> 6.6%
>
> 6.0%
>
> 7.9%
>
> -6.5%
>
> -3.1%
>
> spark-perf
>
> 4.1%
>
> 4.4%
>
> -1.8%
>
> 4.1%
>
> -4.7%
>
> -4.6%
>
>
> *Y-Axis: normalized completion time; X-Axis: Work Week. *
>
> * The commit number can be found in the result table. The performance
> score for each workload is normalized based on the elapsed time for 1.2
> release.The lower the better.*
> HiBench
> --
>
> *JOB*
>
> *ww19 *
>
> *ww20 *
>
> *ww22 *
>
> *ww23 *
>
> *ww24 *
>
> *ww25 *
>
> *commit*
>
> *489700c8 *
>
> *8e3822a0 *
>
> *530efe3e *
>
> *90c60692 *
>
> *db81b9d8 *
>
> *4eb48ed1 *
>
> sleep
>
> %
>
> %
>
> -2.1%
>
> -2.9%
>
> -4.1%
>
> 12.8%
>
> wordcount
>
> 17.6%
>
> 11.4%
>
> 8.0%
>
> 8.3%
>
> -18.6%
>
> -10.9%
>
> kmeans
>
> 92.1%
>
> 61.5%
>
> 72.1%
>
> 92.9%
>
> 86.9%
>
> 95.8%
>
> scan
>
> -4.9%
>
> -7.2%
>
> %
>
> -1.1%
>
> -25.5%
>
> -21.0%
>
> bayes
>
> -24.3%
>
> -20.1%
>
> -18.3%
>
> -11.1%
>
> -29.7%
>
> -31.3%
>
> aggregation
>
> 5.6%
>
> 10.5%
>
> %
>
> 9.2%
>
> -15.3%
>
> -15.0%
>
> join
>
> 4.5%
>
> 1.2%
>
> %
>
> 1.0%
>
> -12.7%
>
> -13.9%
>
> sort
>
> -3.3%
>
> -0.5%
>
> -11.9%
>
> -12.5%
>
> -17.5%
>
> -17.3%
>
> pagerank
>
> 2.2%
>
> 3.2%
>
> 4.0%
>
> 2.9%
>
> -11.4%
>
> -13.0%
>
> terasort
>
> -7.1%
>
> -0.2%
>
> -9.5%
>
> -7.3%
>
> -16.7%
>
> -17.0%
>
> Comments: null means no such workload running or workload failed in this
> time.
>
>
> *Y-Axis: normalized completion time; X-Axis: Work Week. *
>
> * The commit number can be found in the result table. The performance
> score for each workload is normalized based on the elapsed time for 1.2
> release.The lower the better.*
> spark-perf
> --
>
> *JOB*
>
> *ww19 *
>
> *ww20 *
>
> *ww22 *
>
> *ww23 *
>
> *ww24 *
>
> *ww25 *
>
> *commit*
>
> *489700c8 *
>
> *8e3822a0 *
>
> *530efe3e *
>
> *90c60692 *
>
> *db81b9d8 *
>
> *4eb48ed1 *
>
> agg
>
> 13.2%
>
> 7.0%
>
> %
>
> 18.3%
>
> 5.2%
>
> 2.5%
>
> agg-int
>
> 16.4%
>
> 21.2%
>
> %
>
> 9.6%
>
> 4.0%
>
> 8.2%
>
> agg-naive
>
> 4.3%
>
> -2.4%
>
> %
>
> -0.8%
>
> -6.7%
>
> -6.8 %
>
> scheduling
>
> -6.1%
>
> -8.9%
>
> -14.5%
>
> -2.1%
>
> -6.4%
>
> -6.5%
>
> count-filter
>
> 4.1%
>
> 1.0%
>
> 6.6%
>
> 6.8%
>
> -10.2%
>
> -10.4%
>
> count
>
> 4.8%
>
> 4.6%
>
> 6.7%
>
> 8.0%
>
> -7.3%
>
> -7.0%
>
> sort
>
> -8.1%
>
> -2.5%
>
> -6.2%
>
> -7.0%
>
> -14.6%
>
> -14.4%
>
> sort-int
>
> 4.5%
>
> 15.3%
>
> -1.6%
>
> -0.1%
>
> -1.5%
>
> -2.2%
>
> Comments: null means no such workload running or workload failed in this
> time.
>
>
> *Y-Axis: normalized completion t

Re: Increase partition count (repartition) without shuffle

2015-06-18 Thread Sandy Ryza
Hi Alexander,

There is currently no way to create an RDD with more partitions than its
parent RDD without causing a shuffle.

However, if the files are splittable, you can set the Hadoop configurations
that control split size to something smaller so that the HadoopRDD ends up
with more partitions.

-Sandy

On Thu, Jun 18, 2015 at 2:26 PM, Ulanov, Alexander 
wrote:

>  Hi,
>
>
>
> Is there a way to increase the amount of partition of RDD without causing
> shuffle? I’ve found JIRA issue
> https://issues.apache.org/jira/browse/SPARK-5997 however there is no
> implementation yet.
>
>
>
> Just in case, I am reading data from ~300 big binary files, which results
> in 300 partitions, then I need to sort my RDD, but it crashes with
> outofmemory exception. If I change the number of partitions to 2000, sort
> works OK, but repartition itself takes a lot of time due to shuffle.
>
>
>
> Best regards, Alexander
>


Re: External Shuffle service over yarn

2015-06-25 Thread Sandy Ryza
Hi Yash,

One of the main advantages is that, if you turn dynamic allocation on, and
executors are discarded, your application is still able to get at the
shuffle data that they wrote out.

-Sandy

On Thu, Jun 25, 2015 at 11:08 PM, yash datta  wrote:

> Hi devs,
>
> Can someone point out if there are any distinct advantages of using
> external shuffle service over yarn (runs on node manager  as an auxiliary
> service
>
> https://issues.apache.org/jira/browse/SPARK-3797)  instead of the default
> execution in the executor containers ?
>
> Please also mention if you have seen any differences having used both ways
> ?
>
> Thanks and Best Regards
> Yash
>
> --
> When events unfold with calm and ease
> When the winds that blow are merely breeze
> Learn from nature, from birds and bees
> Live your life in love, and let joy not cease.
>


Re: How to Read Excel file in Spark 1.4

2015-07-13 Thread Sandy Ryza
Hi Su,

Spark can't read excel files directly.  Your best best is probably to
export the contents as a CSV and use the "csvFile" API.

-Sandy

On Mon, Jul 13, 2015 at 9:22 AM, spark user 
wrote:

> Hi
>
> I need your help to save excel data in hive .
>
>
>1. how to read excel file in spark using spark 1.4
>2. How to save using data frame
>
> If you have some sample code pls send
>
> Thanks
>
> su
>


Re: [discuss] Removing individual commit messages from the squash commit message

2015-07-19 Thread Sandy Ryza
+1

On Sat, Jul 18, 2015 at 4:00 PM, Mridul Muralidharan 
wrote:

> Thanks for detailing, definitely sounds better.
> +1
>
> Regards
> Mridul
>
> On Saturday, July 18, 2015, Reynold Xin  wrote:
>
>> A single commit message consisting of:
>>
>> 1. Pull request title (which includes JIRA number and component, e.g.
>> [SPARK-1234][MLlib])
>>
>> 2. Pull request description
>>
>> 3. List of authors contributing to the patch
>>
>> The main thing that changes is 3: we used to also include the individual
>> commits to the pull request branch that are squashed.
>>
>>
>> On Sat, Jul 18, 2015 at 3:45 PM, Mridul Muralidharan 
>> wrote:
>>
>>> Just to clarify, the proposal is to have a single commit msg giving the
>>> jira and pr id?
>>> That sounds like a good change to have.
>>>
>>> Regards
>>> Mridul
>>>
>>>
>>> On Saturday, July 18, 2015, Reynold Xin  wrote:
>>>
 I took a look at the commit messages in git log -- it looks like the
 individual commit messages are not that useful to include, but do make the
 commit messages more verbose. They are usually just a bunch of extremely
 concise descriptions of "bug fixes", "merges", etc:

 cb3f12d [xxx] add whitespace
 6d874a6 [xxx] support pyspark for yarn-client

 89b01f5 [yyy] Update the unit test to add more cases
 275d252 [yyy] Address the comments
 7cc146d [yyy] Address the comments
 2624723 [yyy] Fix rebase conflict
 45befaa [yyy] Update the unit test
 bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue


 Anybody against removing those from the merge script so the log looks
 cleaner? If nobody feels strongly about this, we can just create a JIRA to
 remove them, and only keep the author names.


>>


Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
Edit: the first line should read:

  val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _)

On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza 
wrote:

> This functionality already basically exists in Spark.  To create the
> "grouped RDD", one can run:
>
>   val groupedRdd = rdd.reduceByKey(_ + _)
>
> To get it back into the original form:
>
>   groupedRdd.flatMap(x => List.fill(x._1)(x._2))
>
> -Sandy
>
> -Sandy
>
> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман 
> wrote:
>
>> Hi,
>>
>> I am looking for suitable issue for Master Degree project(it sounds like
>> scalability problems and improvements for spark streaming) and seems like
>> introduction of grouped RDD(for example: don't store
>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can:
>>
>> 1. Reduce memory needed for RDD (roughly, used memory will be:  % of uniq
>> messages)
>> 2. Improve performance(no need to apply function several times for the
>> same message).
>>
>> Can I create ticket and introduce API for grouped RDDs? Is it make sense?
>> Also I will be very appreciated for critic and ideas
>>
>
>


Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
This functionality already basically exists in Spark.  To create the
"grouped RDD", one can run:

  val groupedRdd = rdd.reduceByKey(_ + _)

To get it back into the original form:

  groupedRdd.flatMap(x => List.fill(x._1)(x._2))

-Sandy

-Sandy

On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман 
wrote:

> Hi,
>
> I am looking for suitable issue for Master Degree project(it sounds like
> scalability problems and improvements for spark streaming) and seems like
> introduction of grouped RDD(for example: don't store
> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can:
>
> 1. Reduce memory needed for RDD (roughly, used memory will be:  % of uniq
> messages)
> 2. Improve performance(no need to apply function several times for the
> same message).
>
> Can I create ticket and introduce API for grouped RDDs? Is it make sense?
> Also I will be very appreciated for critic and ideas
>


Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
The user gets to choose what they want to reside in memory.  If they call
rdd.cache() on the original RDD, it will be in memory.  If they call
rdd.cache() on the compact RDD, it will be in memory.  If cache() is called
on both, they'll both be in memory.

-Sandy

On Sun, Jul 19, 2015 at 11:09 AM, Сергей Лихоман 
wrote:

> Thanks for answer! Could you please answer for one more question? Will we
> have in memory original rdd and grouped rdd in the same time?
>
> 2015-07-19 21:04 GMT+03:00 Sandy Ryza :
>
>> Edit: the first line should read:
>>
>>   val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _)
>>
>> On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza 
>> wrote:
>>
>>> This functionality already basically exists in Spark.  To create the
>>> "grouped RDD", one can run:
>>>
>>>   val groupedRdd = rdd.reduceByKey(_ + _)
>>>
>>> To get it back into the original form:
>>>
>>>   groupedRdd.flatMap(x => List.fill(x._1)(x._2))
>>>
>>> -Sandy
>>>
>>> -Sandy
>>>
>>> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am looking for suitable issue for Master Degree project(it sounds
>>>> like scalability problems and improvements for spark streaming) and seems
>>>> like introduction of grouped RDD(for example: don't store
>>>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can:
>>>>
>>>> 1. Reduce memory needed for RDD (roughly, used memory will be:  % of
>>>> uniq messages)
>>>> 2. Improve performance(no need to apply function several times for the
>>>> same message).
>>>>
>>>> Can I create ticket and introduce API for grouped RDDs? Is it make
>>>> sense? Also I will be very appreciated for critic and ideas
>>>>
>>>
>>>
>>
>


Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
In the Spark model, constructing an RDD does not mean storing all its
contents in memory.  Rather, an RDD is a description of a dataset that
enables iterating over its contents, record by record (in parallel).  The
only time the full contents of an RDD are stored in memory is when a user
explicitly calls "cache" or "persist" on it.

-Sandy

On Sun, Jul 19, 2015 at 11:41 AM, Сергей Лихоман 
wrote:

> Sorry, maybe I am saying something completely wrong...  we have a stream,
> we digitize it to created rdd. rdd in this case will be just array of any.
> than we apply transformation to create new grouped rdd and GC should remove
> original rdd from memory(if we won't persist it). Will we have GC step in  val
> groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) ? my suggestion is to
> remove creation and reclaiming of unneeded rdd and create already grouped
> one
>
> 2015-07-19 21:26 GMT+03:00 Sandy Ryza :
>
>> The user gets to choose what they want to reside in memory.  If they call
>> rdd.cache() on the original RDD, it will be in memory.  If they call
>> rdd.cache() on the compact RDD, it will be in memory.  If cache() is called
>> on both, they'll both be in memory.
>>
>> -Sandy
>>
>> On Sun, Jul 19, 2015 at 11:09 AM, Сергей Лихоман 
>> wrote:
>>
>>> Thanks for answer! Could you please answer for one more question? Will
>>> we have in memory original rdd and grouped rdd in the same time?
>>>
>>> 2015-07-19 21:04 GMT+03:00 Sandy Ryza :
>>>
>>>> Edit: the first line should read:
>>>>
>>>>   val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _)
>>>>
>>>> On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza 
>>>> wrote:
>>>>
>>>>> This functionality already basically exists in Spark.  To create the
>>>>> "grouped RDD", one can run:
>>>>>
>>>>>   val groupedRdd = rdd.reduceByKey(_ + _)
>>>>>
>>>>> To get it back into the original form:
>>>>>
>>>>>   groupedRdd.flatMap(x => List.fill(x._1)(x._2))
>>>>>
>>>>> -Sandy
>>>>>
>>>>> -Sandy
>>>>>
>>>>> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <
>>>>> sergliho...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am looking for suitable issue for Master Degree project(it sounds
>>>>>> like scalability problems and improvements for spark streaming) and seems
>>>>>> like introduction of grouped RDD(for example: don't store
>>>>>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can:
>>>>>>
>>>>>> 1. Reduce memory needed for RDD (roughly, used memory will be:  % of
>>>>>> uniq messages)
>>>>>> 2. Improve performance(no need to apply function several times for
>>>>>> the same message).
>>>>>>
>>>>>> Can I create ticket and introduce API for grouped RDDs? Is it make
>>>>>> sense? Also I will be very appreciated for critic and ideas
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Developer API & plugins for Hive & Hadoop ?

2015-08-13 Thread Sandy Ryza
Hi Tom,

Not sure how much this helps, but are you aware that you can build Spark
with the -Phadoop-provided profile to avoid packaging Hadoop dependencies
in the assembly jar?

-Sandy

On Fri, Aug 14, 2015 at 6:08 AM, Thomas Dudziak  wrote:

> Unfortunately it doesn't because our version of Hive has different syntax
> elements and thus I need to patch them in (and a few other minor things).
> It would be great if there would be a developer api on a somewhat higher
> level.
>
> On Thu, Aug 13, 2015 at 2:19 PM, Reynold Xin  wrote:
>
>> I believe for Hive, there is already a client interface that can be used
>> to build clients for different Hive metastores. That should also work for
>> your heavily forked one.
>>
>> For Hadoop, it is definitely a bigger project to refactor. A good way to
>> start evaluating this is to list what needs to be changed. Maybe you can
>> start by telling us what you need to change for every upgrade? Feel free to
>> email me in private if this is sensitive and you don't want to share in a
>> public list.
>>
>>
>>
>>
>>
>>
>> On Thu, Aug 13, 2015 at 2:01 PM, Thomas Dudziak  wrote:
>>
>>> Hi,
>>>
>>> I have asked this before but didn't receive any comments, but with the
>>> impending release of 1.5 I wanted to bring this up again.
>>> Right now, Spark is very tightly coupled with OSS Hive & Hadoop which
>>> causes me a lot of work every time there is a new version because I don't
>>> run OSS Hive/Hadoop versions (and before you ask, I can't).
>>>
>>> My question is, does Spark need to be so tightly coupled with these two
>>> ? Or put differently, would it be possible to introduce a developer API
>>> between Spark (up and including e.g. SqlContext) and Hadoop (for HDFS bits)
>>> and Hive (e.g. HiveContext and beyond) and move the actual Hadoop & Hive
>>> dependencies into plugins (e.g. separate maven modules)?
>>> This would allow me to easily maintain my own Hive/Hadoop-ish
>>> integration with our internal systems without ever having to touch Spark
>>> code.
>>> I expect this could also allow for instance Hadoop vendors to provide
>>> their own, more optimized implementations without Spark having to know
>>> about them.
>>>
>>> cheers,
>>> Tom
>>>
>>
>>
>


Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-24 Thread Sandy Ryza
Cool, thanks!

On Mon, Aug 24, 2015 at 2:07 PM, Reynold Xin  wrote:

> Nope --- I cut that last Friday but had an error. I will remove it and cut
> a new one.
>
>
> On Mon, Aug 24, 2015 at 2:06 PM, Sandy Ryza 
> wrote:
>
>> I see that there's an 1.5.0-rc2 tag in github now.  Is that the official
>> RC2 tag to start trying out?
>>
>> -Sandy
>>
>> On Mon, Aug 24, 2015 at 7:23 AM, Sean Owen  wrote:
>>
>>> PS Shixiong Zhu is correct that this one has to be fixed:
>>> https://issues.apache.org/jira/browse/SPARK-10168
>>>
>>> For example you can see assemblies like this are nearly empty:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1137/org/apache/spark/spark-streaming-flume-assembly_2.10/1.5.0-rc1/
>>>
>>> Just a publishing glitch but worth a few more eyes on.
>>>
>>> On Fri, Aug 21, 2015 at 5:28 PM, Sean Owen  wrote:
>>> > Signatures, license, etc. look good. I'm getting some fairly
>>> > consistent failures using Java 7 + Ubuntu 15 + "-Pyarn -Phive
>>> > -Phive-thriftserver -Phadoop-2.6" -- does anyone else see these? they
>>> > are likely just test problems, but worth asking. Stack traces are at
>>> > the end.
>>> >
>>> > There are currently 79 issues targeted for 1.5.0, of which 19 are
>>> > bugs, of which 1 is a blocker. (1032 have been resolved for 1.5.0.)
>>> > That's significantly better than at the last release. I presume a lot
>>> > of what's still targeted is not critical and can now be
>>> > untargeted/retargeted.
>>> >
>>> > It occurs to me that the flurry of planning that took place at the
>>> > start of the 1.5 QA cycle a few weeks ago was quite helpful, and is
>>> > the kind of thing that would be even more useful at the start of a
>>> > release cycle. So would be great to do this for 1.6 in a few weeks.
>>> > Indeed there are already 267 issues targeted for 1.6.0 -- a decent
>>> > roadmap already.
>>> >
>>> >
>>> > Test failures:
>>> >
>>> > Core
>>> >
>>> > - Unpersisting TorrentBroadcast on executors and driver in distributed
>>> > mode *** FAILED ***
>>> >   java.util.concurrent.TimeoutException: Can't find 2 executors before
>>> > 1 milliseconds elapsed
>>> >   at
>>> org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:561)
>>> >   at
>>> org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:313)
>>> >   at org.apache.spark.broadcast.BroadcastSuite.org
>>> $apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:287)
>>> >   at
>>> org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply$mcV$sp(BroadcastSuite.scala:165)
>>> >   at
>>> org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
>>> >   at
>>> org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
>>> >   at
>>> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>>> >   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>>> >   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>>> >   at org.scalatest.Transformer.apply(Transformer.scala:22)
>>> >   ...
>>> >
>>> > Streaming
>>> >
>>> > - stop slow receiver gracefully *** FAILED ***
>>> >   0 was not greater than 0 (StreamingContextSuite.scala:324)
>>> >
>>> > Kafka
>>> >
>>> > - offset recovery *** FAILED ***
>>> >   The code passed to eventually never returned normally. Attempted 191
>>> > times over 10.043196973 seconds. Last failure message:
>>> > strings.forall({
>>> > ((elem: Any) =>
>>> DirectKafkaStreamSuite.collectedData.contains(elem))
>>> >   }) was false. (DirectKafkaStreamSuite.scala:249)
>>> >
>>> > On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin 
>>> wrote:
>>> >> Please vote on releasing the following candidate as Apache Spark
>>> version
>>> >> 1.5.0!
>>> >>
>>> >> The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes
>>> if a
>>> >> majority of at least 3 +1 PMC votes are cast.
>>> &g

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-24 Thread Sandy Ryza
I see that there's an 1.5.0-rc2 tag in github now.  Is that the official
RC2 tag to start trying out?

-Sandy

On Mon, Aug 24, 2015 at 7:23 AM, Sean Owen  wrote:

> PS Shixiong Zhu is correct that this one has to be fixed:
> https://issues.apache.org/jira/browse/SPARK-10168
>
> For example you can see assemblies like this are nearly empty:
>
> https://repository.apache.org/content/repositories/orgapachespark-1137/org/apache/spark/spark-streaming-flume-assembly_2.10/1.5.0-rc1/
>
> Just a publishing glitch but worth a few more eyes on.
>
> On Fri, Aug 21, 2015 at 5:28 PM, Sean Owen  wrote:
> > Signatures, license, etc. look good. I'm getting some fairly
> > consistent failures using Java 7 + Ubuntu 15 + "-Pyarn -Phive
> > -Phive-thriftserver -Phadoop-2.6" -- does anyone else see these? they
> > are likely just test problems, but worth asking. Stack traces are at
> > the end.
> >
> > There are currently 79 issues targeted for 1.5.0, of which 19 are
> > bugs, of which 1 is a blocker. (1032 have been resolved for 1.5.0.)
> > That's significantly better than at the last release. I presume a lot
> > of what's still targeted is not critical and can now be
> > untargeted/retargeted.
> >
> > It occurs to me that the flurry of planning that took place at the
> > start of the 1.5 QA cycle a few weeks ago was quite helpful, and is
> > the kind of thing that would be even more useful at the start of a
> > release cycle. So would be great to do this for 1.6 in a few weeks.
> > Indeed there are already 267 issues targeted for 1.6.0 -- a decent
> > roadmap already.
> >
> >
> > Test failures:
> >
> > Core
> >
> > - Unpersisting TorrentBroadcast on executors and driver in distributed
> > mode *** FAILED ***
> >   java.util.concurrent.TimeoutException: Can't find 2 executors before
> > 1 milliseconds elapsed
> >   at
> org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:561)
> >   at
> org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:313)
> >   at org.apache.spark.broadcast.BroadcastSuite.org
> $apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:287)
> >   at
> org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply$mcV$sp(BroadcastSuite.scala:165)
> >   at
> org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
> >   at
> org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
> >   at
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> >   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> >   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> >   at org.scalatest.Transformer.apply(Transformer.scala:22)
> >   ...
> >
> > Streaming
> >
> > - stop slow receiver gracefully *** FAILED ***
> >   0 was not greater than 0 (StreamingContextSuite.scala:324)
> >
> > Kafka
> >
> > - offset recovery *** FAILED ***
> >   The code passed to eventually never returned normally. Attempted 191
> > times over 10.043196973 seconds. Last failure message:
> > strings.forall({
> > ((elem: Any) => DirectKafkaStreamSuite.collectedData.contains(elem))
> >   }) was false. (DirectKafkaStreamSuite.scala:249)
> >
> > On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin 
> wrote:
> >> Please vote on releasing the following candidate as Apache Spark version
> >> 1.5.0!
> >>
> >> The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a
> >> majority of at least 3 +1 PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Spark 1.5.0
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see http://spark.apache.org/
> >>
> >>
> >> The tag to be voted on is v1.5.0-rc1:
> >>
> https://github.com/apache/spark/tree/4c56ad772637615cc1f4f88d619fac6c372c8552
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-bin/
> >>
> >> Release artifacts are signed with the following key:
> >> https://people.apache.org/keys/committer/pwendell.asc
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1137/
> >>
> >> The documentation corresponding to this release can be found at:
> >> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-docs/
> >>
> >>
> >> ===
> >> == How can I help test this release? ==
> >> ===
> >> If you are a Spark user, you can help us test this release by taking an
> >> existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >>
> >> 
> >> == What justifies a -1 vote for this release? ==
> >> 
> >> This vote is happening towards the end of

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-30 Thread Sandy Ryza
+1 (non-binding)
built from source and ran some jobs against YARN

-Sandy

On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan  wrote:

>
> +1 (1.5.0 RC2)Compiled on Windows with YARN.
>
> Regards,
> Vaquar khan
> +1 (non-binding, of course)
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
>  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> 2. Tested pyspark, mllib
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Laso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>Center And Scale OK
> 2.5. RDD operations OK
>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>Model evaluation/optimization (rank, numIter, lambda) with
> itertools OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWithSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 3.6. saveAsParquetFile OK
> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> registerTempTable, sql OK
> 3.8. result = sqlContext.sql("SELECT
> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> 4.0. Spark SQL from Python OK
> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
> 5.0. Packages
> 5.1. com.databricks.spark.csv - read/write OK
> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
> com.databricks:spark-csv_2.11:1.2.0 worked)
> 6.0. DataFrames
> 6.1. cast,dtypes OK
> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> 6.3. joins,sql,set operations,udf OK
>
> Cheers
> 
>
> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.5.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>>
>> The tag to be voted on is v1.5.0-rc2:
>>
>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release (published as 1.5.0-rc2) can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>
>> The staging repository for this release (published as 1.5.0) can be found
>> at:
>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>>
>> 
>> What justifies a -1 vote for this release?
>> 
>> This vote is happening towards the end of the 1.5 QA period, so -1 votes
>> should only occur for significant regressions from 1.4. Bugs already
>> present in 1.4, minor regressions, or bugs related to new features will not
>> block this release.
>>
>>
>> ===
>> What should happen to JIRA tickets still targeting 1.5.0?
>> ===
>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>> branch-1.5, since documentations will be packaged separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.6+.
>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
>> version.
>>
>>
>> ==
>> Major changes to help you focus your testing
>> ==
>>
>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>> contributors. I've curated a list of important changes for 1.5. For the
>> complete list, please refer to Apache JIRA changelog.
>>
>> RDD/DataFrame/SQL APIs
>>
>> - New UDAF interface
>> - DataFrame hints for broadcast join
>> - expr function for turning a SQL expression into DataFrame column
>> - Improved support for NaN values
>> - StructType now supports ordering
>> - TimestampType precision is reduced to 1us
>> - 100 new built-in expressions, includin

Re: Info about Dataset

2015-11-03 Thread Sandy Ryza
Hi Justin,

The Dataset API proposal is available here:
https://issues.apache.org/jira/browse/SPARK-.

-Sandy

On Tue, Nov 3, 2015 at 1:41 PM, Justin Uang  wrote:

> Hi,
>
> I was looking through some of the PRs slated for 1.6.0 and I noted
> something called a Dataset, which looks like a new concept based off of the
> scaladoc for the class. Can anyone point me to some references/design_docs
> regarding the choice to introduce the new concept? I presume it is probably
> something to do with performance optimizations?
>
> Thanks!
>
> Justin
>


Re: A proposal for Spark 2.0

2015-11-10 Thread Sandy Ryza
Another +1 to Reynold's proposal.

Maybe this is obvious, but I'd like to advocate against a blanket removal
of deprecated / developer APIs.  Many APIs can likely be removed without
material impact (e.g. the SparkContext constructor that takes preferred
node location data), while others likely see heavier usage (e.g. I wouldn't
be surprised if mapPartitionsWithContext was baked into a number of apps)
and merit a little extra consideration.

Maybe also obvious, but I think a migration guide with API equivlents and
the like would be incredibly useful in easing the transition.

-Sandy

On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin  wrote:

> Echoing Shivaram here. I don't think it makes a lot of sense to add more
> features to the 1.x line. We should still do critical bug fixes though.
>
>
> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> +1
>>
>> On a related note I think making it lightweight will ensure that we
>> stay on the current release schedule and don't unnecessarily delay 2.0
>> to wait for new features / big architectural changes.
>>
>> In terms of fixes to 1.x, I think our current policy of back-porting
>> fixes to older releases would still apply. I don't think developing
>> new features on both 1.x and 2.x makes a lot of sense as we would like
>> users to switch to 2.x.
>>
>> Shivaram
>>
>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
>> wrote:
>> > +1 on a lightweight 2.0
>> >
>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>> If not
>> > terminated, how will we determine what goes into each major version
>> line?
>> > Will 1.x only be for stability fixes?
>> >
>> > Thanks,
>> > Kostas
>> >
>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
>> wrote:
>> >>
>> >> I also feel the same as Reynold. I agree we should minimize API breaks
>> and
>> >> focus on fixing things around the edge that were mistakes (e.g.
>> exposing
>> >> Guava and Akka) rather than any overhaul that could fragment the
>> community.
>> >> Ideally a major release is a lightweight process we can do every
>> couple of
>> >> years, with minimal impact for users.
>> >>
>> >> - Patrick
>> >>
>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>> >>  wrote:
>> >>>
>> >>> > For this reason, I would *not* propose doing major releases to break
>> >>> > substantial API's or perform large re-architecting that prevent
>> users from
>> >>> > upgrading. Spark has always had a culture of evolving architecture
>> >>> > incrementally and making changes - and I don't think we want to
>> change this
>> >>> > model.
>> >>>
>> >>> +1 for this. The Python community went through a lot of turmoil over
>> the
>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>> painful
>> >>> for too long. The Spark community will benefit greatly from our
>> explicitly
>> >>> looking to avoid a similar situation.
>> >>>
>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>> >>> > enormous assembly jar in order to run Spark.
>> >>>
>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>> >>> distribution means.
>> >>>
>> >>> Nick
>> >>>
>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
>> wrote:
>> 
>>  I’m starting a new thread since the other one got intermixed with
>>  feature requests. Please refrain from making feature request in this
>> thread.
>>  Not that we shouldn’t be adding features, but we can always add
>> features in
>>  1.7, 2.1, 2.2, ...
>> 
>>  First - I want to propose a premise for how to think about Spark 2.0
>> and
>>  major releases in Spark, based on discussion with several members of
>> the
>>  community: a major release should be low overhead and minimally
>> disruptive
>>  to the Spark community. A major release should not be very different
>> from a
>>  minor release and should not be gated based on new features. The main
>>  purpose of a major release is an opportunity to fix things that are
>> broken
>>  in the current API and remove certain deprecated APIs (examples
>> follow).
>> 
>>  For this reason, I would *not* propose doing major releases to break
>>  substantial API's or perform large re-architecting that prevent
>> users from
>>  upgrading. Spark has always had a culture of evolving architecture
>>  incrementally and making changes - and I don't think we want to
>> change this
>>  model. In fact, we’ve released many architectural changes on the 1.X
>> line.
>> 
>>  If the community likes the above model, then to me it seems
>> reasonable
>>  to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>> immediately
>>  after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>> cadence of
>>  major releases every 2 years seems doable within the above model.
>> 
>>  Under this model, here is a list of example things I would propose
>> doing
>>  in Spark

Re: A proposal for Spark 2.0

2015-11-10 Thread Sandy Ryza
Oh and another question - should Spark 2.0 support Java 7?

On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza  wrote:

> Another +1 to Reynold's proposal.
>
> Maybe this is obvious, but I'd like to advocate against a blanket removal
> of deprecated / developer APIs.  Many APIs can likely be removed without
> material impact (e.g. the SparkContext constructor that takes preferred
> node location data), while others likely see heavier usage (e.g. I wouldn't
> be surprised if mapPartitionsWithContext was baked into a number of apps)
> and merit a little extra consideration.
>
> Maybe also obvious, but I think a migration guide with API equivlents and
> the like would be incredibly useful in easing the transition.
>
> -Sandy
>
> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin  wrote:
>
>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>> features to the 1.x line. We should still do critical bug fixes though.
>>
>>
>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> +1
>>>
>>> On a related note I think making it lightweight will ensure that we
>>> stay on the current release schedule and don't unnecessarily delay 2.0
>>> to wait for new features / big architectural changes.
>>>
>>> In terms of fixes to 1.x, I think our current policy of back-porting
>>> fixes to older releases would still apply. I don't think developing
>>> new features on both 1.x and 2.x makes a lot of sense as we would like
>>> users to switch to 2.x.
>>>
>>> Shivaram
>>>
>>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
>>> wrote:
>>> > +1 on a lightweight 2.0
>>> >
>>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>>> If not
>>> > terminated, how will we determine what goes into each major version
>>> line?
>>> > Will 1.x only be for stability fixes?
>>> >
>>> > Thanks,
>>> > Kostas
>>> >
>>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
>>> wrote:
>>> >>
>>> >> I also feel the same as Reynold. I agree we should minimize API
>>> breaks and
>>> >> focus on fixing things around the edge that were mistakes (e.g.
>>> exposing
>>> >> Guava and Akka) rather than any overhaul that could fragment the
>>> community.
>>> >> Ideally a major release is a lightweight process we can do every
>>> couple of
>>> >> years, with minimal impact for users.
>>> >>
>>> >> - Patrick
>>> >>
>>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>> >>  wrote:
>>> >>>
>>> >>> > For this reason, I would *not* propose doing major releases to
>>> break
>>> >>> > substantial API's or perform large re-architecting that prevent
>>> users from
>>> >>> > upgrading. Spark has always had a culture of evolving architecture
>>> >>> > incrementally and making changes - and I don't think we want to
>>> change this
>>> >>> > model.
>>> >>>
>>> >>> +1 for this. The Python community went through a lot of turmoil over
>>> the
>>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>>> painful
>>> >>> for too long. The Spark community will benefit greatly from our
>>> explicitly
>>> >>> looking to avoid a similar situation.
>>> >>>
>>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>>> >>> > enormous assembly jar in order to run Spark.
>>> >>>
>>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>>> >>> distribution means.
>>> >>>
>>> >>> Nick
>>> >>>
>>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
>>> wrote:
>>> >>>>
>>> >>>> I’m starting a new thread since the other one got intermixed with
>>> >>>> feature requests. Please refrain from making feature request in
>>> this thread.
>>> >>>> Not that we shouldn’t be adding features, but we can always add
>>> features in
>>> >>>> 1.7, 2.1, 2.2, ...
>>> 

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-20 Thread Sandy Ryza
To answer your fourth question from Cloudera's perspective, we would never
support a customer running Spark 2.0 on a Hadoop version < 2.6.

-Sandy

On Fri, Nov 20, 2015 at 1:39 PM, Reynold Xin  wrote:

> OK I'm not exactly asking for a vote here :)
>
> I don't think we should look at it from only maintenance point of view --
> because in that case the answer is clearly supporting as few versions as
> possible (or just rm -rf spark source code and call it a day). It is a
> tradeoff between the number of users impacted and the maintenance burden.
>
> So a few questions for those more familiar with Hadoop:
>
> 1. Can Hadoop 2.6 client read Hadoop 2.4 / 2.3?
>
> 2. If the answer to 1 is yes, are there known, major issues with backward
> compatibility?
>
> 3. Can Hadoop 2.6+ YARN work on older versions of YARN clusters?
>
> 4. (for Hadoop vendors) When did/will support for Hadoop 2.4 and below
> stop? To what extent do you care about running Spark on older Hadoop
> clusters.
>
>
>
> On Fri, Nov 20, 2015 at 7:52 AM, Steve Loughran 
> wrote:
>
>>
>> On 20 Nov 2015, at 14:28, ches...@alpinenow.com wrote:
>>
>> Assuming we have 1.6 and 1.7 releases, then spark 2.0 is about 9 months
>> away.
>>
>> customer will need to upgrade the new Hadoop clusters to Apache 2.6 or
>> later to leverage new spark 2.0 in one year. I think this possible as
>> latest release on cdh5.x,  HDP 2.x are both on Apache 2.6.0 already.
>> Company will have enough time to upgrade cluster.
>>
>> +1 for me as well
>>
>> Chester
>>
>>
>> now, if you are looking that far ahead, the other big issue is "when to
>> retire Java 7 support".?
>>
>> That's a tough decision for all projects. Hadoop 3.x will be Java 8 only,
>> but nobody has committed the patch to the trunk codebase to force a java 8
>> build; + most of *todays* hadoop clusters are Java 7. But as you can't even
>> download a Java 7 JDK for the desktop from oracle any more today, 2016 is a
>> time to look at the language support and decide what is the baseline
>> version
>>
>> Commentary from Twitter here -as they point out, it's not just the server
>> farm that matters, it's all the apps that talk to it
>>
>>
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201503.mbox/%3ccab7mwte+kefcxsr6n46-ztcs19ed7cwc9vobtr1jqewdkye...@mail.gmail.com%3E
>>
>> -Steve
>>
>
>


Re: A proposal for Spark 2.0

2015-11-24 Thread Sandy Ryza
I think that Kostas' logic still holds.  The majority of Spark users, and
likely an even vaster majority of people running vaster jobs, are still on
RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
to upgrade to the stable version of the Dataset / DataFrame API so they
don't need to do so twice.  Requiring that they absorb all the other ways
that Spark breaks compatibility in the move to 2.0 makes it much more
difficult for them to make this transition.

Using the same set of APIs also means that it will be easier to backport
critical fixes to the 1.x line.

It's not clear to me that avoiding breakage of an experimental API in the
1.x line outweighs these issues.

-Sandy

On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin  wrote:

> I actually think the next one (after 1.6) should be Spark 2.0. The reason
> is that I already know we have to break some part of the DataFrame/Dataset
> API as part of the Dataset design. (e.g. DataFrame.map should return
> Dataset rather than RDD). In that case, I'd rather break this sooner (in
> one release) than later (in two releases). so the damage is smaller.
>
> I don't think whether we call Dataset/DataFrame experimental or not
> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
> then mark them as stable in 2.1. Despite being "experimental", there has
> been no breaking changes to DataFrame from 1.3 to 1.6.
>
>
>
> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
> wrote:
>
>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>> fixing.  We're on the same page now.
>>
>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
>> wrote:
>>
>>> A 1.6.x release will only fix bugs - we typically don't change APIs in z
>>> releases. The Dataset API is experimental and so we might be changing the
>>> APIs before we declare it stable. This is why I think it is important to
>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>>> Spark 2.0. This will benefit users that would like to use the new Dataset
>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>
>>> Kostas
>>>
>>>
>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra 
>>> wrote:
>>>
 Why does stabilization of those two features require a 1.7 release
 instead of 1.6.1?

 On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis 
 wrote:

> We have veered off the topic of Spark 2.0 a little bit here - yes we
> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like
> to propose we have one more 1.x release after Spark 1.6. This will allow 
> us
> to stabilize a few of the new features that were added in 1.6:
>
> 1) the experimental Datasets API
> 2) the new unified memory manager.
>
> I understand our goal for Spark 2.0 is to offer an easy transition but
> there will be users that won't be able to seamlessly upgrade given what we
> have discussed as in scope for 2.0. For these users, having a 1.x release
> with these new features/APIs stabilized will be very beneficial. This 
> might
> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>
> Any thoughts on this timeline?
>
> Kostas Sakellis
>
>
>
> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao 
> wrote:
>
>> Agree, more features/apis/optimization need to be added in DF/DS.
>>
>>
>>
>> I mean, we need to think about what kind of RDD APIs we have to
>> provide to developer, maybe the fundamental API is enough, like, the
>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as
>> we can do the same thing easily with DF/DS, even better performance.
>>
>>
>>
>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>> *Sent:* Friday, November 13, 2015 11:23 AM
>> *To:* Stephen Boesch
>>
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Hmmm... to me, that seems like precisely the kind of thing that
>> argues for retaining the RDD API but not as the first thing presented to
>> new Spark developers: "Here's how to use groupBy with DataFrames 
>> Until
>> the optimizer is more fully developed, that won't always get you the best
>> performance that can be obtained.  In these particular circumstances, 
>> ...,
>> you may want to use the low-level RDD API while setting
>> preservesPartitioning to true.  Like this"
>>
>>
>>
>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
>> wrote:
>>
>> My understanding is that  the RDD's presently have more support for
>> complete control of partitioning which is a key consideration at scale.
>> While partitioning control is still piecemeal in  DF/DS  it would seem
>> premature to make RDD's a second-tier approach 

Re: A proposal for Spark 2.0

2015-11-25 Thread Sandy Ryza
I see.  My concern is / was that cluster operators will be reluctant to
upgrade to 2.0, meaning that developers using those clusters need to stay
on 1.x, and, if they want to move to DataFrames, essentially need to port
their app twice.

I misunderstood and thought part of the proposal was to drop support for
2.10 though.  If your broad point is that there aren't changes in 2.0 that
will make it less palatable to cluster administrators than releases in the
1.x line, then yes, 2.0 as the next release sounds fine to me.

-Sandy


On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia 
wrote:

> What are the other breaking changes in 2.0 though? Note that we're not
> removing Scala 2.10, we're just making the default build be against Scala
> 2.11 instead of 2.10. There seem to be very few changes that people would
> worry about. If people are going to update their apps, I think it's better
> to make the other small changes in 2.0 at the same time than to update once
> for Dataset and another time for 2.0.
>
> BTW just refer to Reynold's original post for the other proposed API
> changes.
>
> Matei
>
> On Nov 24, 2015, at 12:27 PM, Sandy Ryza  wrote:
>
> I think that Kostas' logic still holds.  The majority of Spark users, and
> likely an even vaster majority of people running vaster jobs, are still on
> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
> to upgrade to the stable version of the Dataset / DataFrame API so they
> don't need to do so twice.  Requiring that they absorb all the other ways
> that Spark breaks compatibility in the move to 2.0 makes it much more
> difficult for them to make this transition.
>
> Using the same set of APIs also means that it will be easier to backport
> critical fixes to the 1.x line.
>
> It's not clear to me that avoiding breakage of an experimental API in the
> 1.x line outweighs these issues.
>
> -Sandy
>
> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin  wrote:
>
>> I actually think the next one (after 1.6) should be Spark 2.0. The reason
>> is that I already know we have to break some part of the DataFrame/Dataset
>> API as part of the Dataset design. (e.g. DataFrame.map should return
>> Dataset rather than RDD). In that case, I'd rather break this sooner (in
>> one release) than later (in two releases). so the damage is smaller.
>>
>> I don't think whether we call Dataset/DataFrame experimental or not
>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>> then mark them as stable in 2.1. Despite being "experimental", there has
>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>
>>
>>
>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
>> wrote:
>>
>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>>> fixing.  We're on the same page now.
>>>
>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
>>> wrote:
>>>
>>>> A 1.6.x release will only fix bugs - we typically don't change APIs in
>>>> z releases. The Dataset API is experimental and so we might be changing the
>>>> APIs before we declare it stable. This is why I think it is important to
>>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>>>> Spark 2.0. This will benefit users that would like to use the new Dataset
>>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>>
>>>> Kostas
>>>>
>>>>
>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra >>> > wrote:
>>>>
>>>>> Why does stabilization of those two features require a 1.7 release
>>>>> instead of 1.6.1?
>>>>>
>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis >>>> > wrote:
>>>>>
>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like
>>>>>> to propose we have one more 1.x release after Spark 1.6. This will allow 
>>>>>> us
>>>>>> to stabilize a few of the new features that were added in 1.6:
>>>>>>
>>>>>> 1) the experimental Datasets API
>>>>>> 2) the new unified memory manager.
>>>>>>
>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>> but there will be users th

Re: YARN Maven build questions

2014-03-04 Thread Sandy Ryza
Hi Lars,

Unfortunately, due to some incompatible changes we pulled in to be closer
to YARN trunk, Spark-on-YARN does not work against CDH 4.4+ (but does work
against CDH5)

-Sandy


On Tue, Mar 4, 2014 at 6:33 AM, Tom Graves  wrote:

> What is your question about Any hints?
> The maven build worked for me yesterday again fine.
>
> You should create a jira for any pull request like the documentation
> states.  The jira thing is new so I think people are still getting used to
> it.
>
> Tom
>
>
>
> On Tuesday, March 4, 2014 2:51 AM, Lars Francke 
> wrote:
>
> Hi,
>
> sorry to bother again.
>
> As a newbie to the project it's hard to judge whether I'm doing
> anything wrong, the documentation is outdated or the Maven/SBT files
> have diverged from the actual code by defining older/now incompatible
> versions or something else going wrong.
>
> Any hints?
>
> Also an unrelated note/question: I see tons of pull requests being
> accepted without a JIRA but the documentation says to create a JIRA
> issue first[1]. So I assume it's okay to just send pull requests?
>
> Thanks for your help.
>
> Cheers,
> Lars
>
> [1] <
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark>
>
>
> On Fri, Feb 28, 2014 at 6:41 PM, Lars Francke 
> wrote:
> > Hey,
> >
> > so currently it doesn't work because of
> > 
> >
> > IntelliJ reports a lot of warnings with default settings and I haven't
> > found a way to tell IntellJ to use different Hadoop versions yet.
> > mvn clean compile -Pyarn fails as well (compilation errror
> >
> > Your command works indeed. Default yarn version is 0.23.7 which
> > doesn't seem to work with the default 2.2.0 Hadoop version (anymore?)
> >
> > I was basically trying to follow the documentation:
> > 
> >
> > mvn clean compile -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.5.0
> > -Dyarn.version=2.0.0-cdh4.5.0 fails as well as does mvn clean compile
> > -Pyarn-alpha
> >
> > Thanks for showing me a configuration that works. Unfortunately the
> > default ones and at least one of the documented ones fail.
> >
> > Cheers,
> > Lars
> >
> >
> > On Fri, Feb 28, 2014 at 3:05 PM, Tom Graves 
> wrote:
> >> what build command are you using?What do you mean when you say YARN
> branch?
> >>
> >> The yarn builds have been working fine for me with maven.   Build
> command I use against hadoop 2.2 or higher: mvn -Dyarn.version=2.2.0
> -Dhadoop.version=2.2.0 -Pyarn clean package -DskipTests
> >>
> >> Tom
> >>
> >>
> >>
> >> On Friday, February 28, 2014 6:14 AM, Lars Francke <
> lars.fran...@gmail.com> wrote:
> >>
> >> Hey,
> >>
> >> I'm trying to dig into Spark's code but am running into a couple of
> problems.
> >>
> >> 1) The yarn-common directory is not included in the Maven build
> >> causing things to fail because the dependency is missing. If I see the
> >> history correct it used to be a Maven module but is not anymore.
> >>
> >> 2) When I try to include the yarn-common directory in the build things
> >> start going bad. Compilation failures all over the place and I think
> >> there are some dependency issues in there as well.
> >>
> >> This leads me to believe that either the Maven build system isn't
> >> maintained for YARN or the whole YARN branch isn't. What's the status
> >> here?
> >>
> >> Without YARN things build fine for me using Maven.
> >>
> >> Thanks for your help.
> >>
> >> Cheers,
> >> Lars
>


Re: Assigning JIRA's to self

2014-03-12 Thread Sandy Ryza
In the mean time, you don't need to wait for the task to be assigned to you
to start work.  If you're worried about someone else picking it up, you can
drop a short comment on the JIRA saying that you're working on it.


On Wed, Mar 12, 2014 at 3:25 PM, Konstantin Boudnik  wrote:

> Someone with proper karma needs to add you to the contributor list in the
> JIRA.
>
> Cos
>
> On Wed, Mar 12, 2014 at 02:19PM, Sujeet Varakhedi wrote:
> > Hi,
> > I am new to Spark and would like to contribute. I wanted to assign a task
> > to myself but looks like I do not have permission. What is the process
> if I
> > want to work on a JIRA?
> >
> > Thanks
> > Sujeet
>


Re: cloudera repo down again - mqtt

2014-03-14 Thread sandy . ryza
Our guys are looking into it. I'll post when things are back up.

-Sandy

> On Mar 14, 2014, at 7:37 AM, Tom Graves  wrote:
> 
> It appears the cloudera repo for the mqtt stuff is down again. 
> 
> Did someone  ping them the last time?  
> 
> Can we pick this up from some other repo?
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-remote-resources-plugin:1.4:process (default) 
> on project spark-examples_2.10: Error resolving project artifact: Could not 
> transfer artifact org.eclipse.paho:mqtt-client:pom:0.4.0 from/to 
> cloudera-repo (https://repository.cloudera.com/artifactory/cloudera-repos): 
> peer not authenticated for project org.eclipse.paho:mqtt-client:jar:0.4.0 
> 
> Tom


Re: Suggestion

2014-04-11 Thread Sandy Ryza
Hi Priya,

Here's a good place to start:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

-Sandy


On Fri, Apr 11, 2014 at 12:05 PM, priya arora wrote:

> Hi,
>
> May I know how one can contribute in this project
> http://spark.apache.org/mllib/ or in any other project. I am very eager to
> contribute. Do let me know.
>
> Thanks & Regards,
> Priya Arora
>


all values for a key must fit in memory

2014-04-20 Thread Sandy Ryza
Hey all,

After a shuffle / groupByKey, Hadoop MapReduce allows the values for a key
to not all fit in memory.  The current ShuffleFetcher.fetch API, which
doesn't distinguish between keys and values, only returning an Iterator[P],
seems incompatible with this.

Any thoughts on how we could achieve parity here?

-Sandy


Re: all values for a key must fit in memory

2014-04-20 Thread Sandy Ryza
The issue isn't that the Iterator[P] can't be disk-backed.  It's that, with
a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read
into memory at once.  The ShuffledRDD is agnostic to what goes inside P.

On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan wrote:

> An iterator does not imply data has to be memory resident.
> Think merge sort output as an iterator (disk backed).
>
> Tom is actually planning to work on something similar with me on this
> hopefully this or next month.
>
> Regards,
> Mridul
>
>
> On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza 
> wrote:
> > Hey all,
> >
> > After a shuffle / groupByKey, Hadoop MapReduce allows the values for a
> key
> > to not all fit in memory.  The current ShuffleFetcher.fetch API, which
> > doesn't distinguish between keys and values, only returning an
> Iterator[P],
> > seems incompatible with this.
> >
> > Any thoughts on how we could achieve parity here?
> >
> > -Sandy
>


Re: all values for a key must fit in memory

2014-04-21 Thread Sandy Ryza
Thanks Matei and Mridul - was basically wondering whether we would be able
to change the shuffle to accommodate this after 1.0, and from your answers
it sounds like we can.


On Mon, Apr 21, 2014 at 12:31 AM, Mridul Muralidharan wrote:

> As Matei mentioned, the Values is now an Iterable : which can be disk
> backed.
> Does that not address the concern ?
>
> @Patrick - we do have cases where the length of the sequence is large
> and size per value is also non trivial : so we do need this :-)
> Note that join is a trivial example where this is required (in our
> current implementation).
>
> Regards,
> Mridul
>
> On Mon, Apr 21, 2014 at 6:25 AM, Sandy Ryza 
> wrote:
> > The issue isn't that the Iterator[P] can't be disk-backed.  It's that,
> with
> > a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read
> > into memory at once.  The ShuffledRDD is agnostic to what goes inside P.
> >
> > On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan  >wrote:
> >
> >> An iterator does not imply data has to be memory resident.
> >> Think merge sort output as an iterator (disk backed).
> >>
> >> Tom is actually planning to work on something similar with me on this
> >> hopefully this or next month.
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >> On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza 
> >> wrote:
> >> > Hey all,
> >> >
> >> > After a shuffle / groupByKey, Hadoop MapReduce allows the values for a
> >> key
> >> > to not all fit in memory.  The current ShuffleFetcher.fetch API, which
> >> > doesn't distinguish between keys and values, only returning an
> >> Iterator[P],
> >> > seems incompatible with this.
> >> >
> >> > Any thoughts on how we could achieve parity here?
> >> >
> >> > -Sandy
> >>
>


Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
If it's not done already, would it make sense to codify this philosophy
somewhere?  I imagine this won't be the first time this discussion comes
up, and it would be nice to have a doc to point to.  I'd be happy to take a
stab at this.


On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng  wrote:

> +1 on Sean's comment. MLlib covers the basic algorithms but we
> definitely need to spend more time on how to make the design scalable.
> For example, think about current "ProblemWithAlgorithm" naming scheme.
> That being said, new algorithms are welcomed. I wish they are
> well-established and well-understood by users. They shouldn't be
> research algorithms tuned to work well with a particular dataset but
> not tested widely. You see the change log from Mahout:
>
> ===
> The following algorithms that were marked deprecated in 0.8 have been
> removed in 0.9:
>
> From Clustering:
>   Switched LDA implementation from using Gibbs Sampling to Collapsed
> Variational Bayes (CVB)
> Meanshift
> MinHash - removed due to poor performance, lack of support and lack of
> usage
>
> From Classification (both are sequential implementations)
> Winnow - lack of actual usage and support
> Perceptron - lack of actual usage and support
>
> Collaborative Filtering
> SlopeOne implementations in
> org.apache.mahout.cf.taste.hadoop.slopeone and
> org.apache.mahout.cf.taste.impl.recommender.slopeone
> Distributed pseudo recommender in
> org.apache.mahout.cf.taste.hadoop.pseudo
> TreeClusteringRecommender in
> org.apache.mahout.cf.taste.impl.recommender
>
> Mahout Math
> Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
> ===
>
> In MLlib, we should include the algorithms users know how to use and
> we can provide support rather than letting algorithms come and go.
>
> My $0.02,
> Xiangrui
>
> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen  wrote:
> > On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown  wrote:
> >> - MLlib as Mahout.next would be a unfortunate.  There are some gems in
> >> Mahout, but there are also lots of rocks.  Setting a minimal bar of
> >> working, correctly implemented, and documented requires a surprising
> amount
> >> of work.
> >
> > As someone with first-hand knowledge, this is correct. To Sang's
> > question, I can't see value in 'porting' Mahout since it is based on a
> > quite different paradigm. About the only part that translates is the
> > algorithm concept itself.
> >
> > This is also the cautionary tale. The contents of the project have
> > ended up being a number of "drive-by" contributions of implementations
> > that, while individually perhaps brilliant (perhaps), didn't
> > necessarily match any other implementation in structure, input/output,
> > libraries used. The implementations were often a touch academic. The
> > result was hard to document, maintain, evolve or use.
> >
> > Far more of the structure of the MLlib implementations are consistent
> > by virtue of being built around Spark core already. That's great.
> >
> > One can't wait to completely build the foundation before building any
> > implementations. To me, the existing implementations are almost
> > exactly the basics I would choose. They cover the bases and will
> > exercise the abstractions and structure. So that's also great IMHO.
>


Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
How do I get permissions to edit the wiki?


On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng  wrote:

> Cannot agree more with your words. Could you add one section about
> "how and what to contribute" to MLlib's guide? -Xiangrui
>
> On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
>  wrote:
> > I'd say a section in the "how to contribute" page would be a good place
> to put this.
> >
> > In general I'd say that the criteria for inclusion of an algorithm is it
> should be high quality, widely known, used and accepted (citations and
> concrete use cases as examples of this), scalable and parallelizable, well
> documented and with reasonable expectation of dev support
> >
> > Sent from my iPhone
> >
> >> On 21 Apr 2014, at 19:59, Sandy Ryza  wrote:
> >>
> >> If it's not done already, would it make sense to codify this philosophy
> >> somewhere?  I imagine this won't be the first time this discussion comes
> >> up, and it would be nice to have a doc to point to.  I'd be happy to
> take a
> >> stab at this.
> >>
> >>
> >>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng 
> wrote:
> >>>
> >>> +1 on Sean's comment. MLlib covers the basic algorithms but we
> >>> definitely need to spend more time on how to make the design scalable.
> >>> For example, think about current "ProblemWithAlgorithm" naming scheme.
> >>> That being said, new algorithms are welcomed. I wish they are
> >>> well-established and well-understood by users. They shouldn't be
> >>> research algorithms tuned to work well with a particular dataset but
> >>> not tested widely. You see the change log from Mahout:
> >>>
> >>> ===
> >>> The following algorithms that were marked deprecated in 0.8 have been
> >>> removed in 0.9:
> >>>
> >>> From Clustering:
> >>>  Switched LDA implementation from using Gibbs Sampling to Collapsed
> >>> Variational Bayes (CVB)
> >>> Meanshift
> >>> MinHash - removed due to poor performance, lack of support and lack of
> >>> usage
> >>>
> >>> From Classification (both are sequential implementations)
> >>> Winnow - lack of actual usage and support
> >>> Perceptron - lack of actual usage and support
> >>>
> >>> Collaborative Filtering
> >>>SlopeOne implementations in
> >>> org.apache.mahout.cf.taste.hadoop.slopeone and
> >>> org.apache.mahout.cf.taste.impl.recommender.slopeone
> >>>Distributed pseudo recommender in
> >>> org.apache.mahout.cf.taste.hadoop.pseudo
> >>>TreeClusteringRecommender in
> >>> org.apache.mahout.cf.taste.impl.recommender
> >>>
> >>> Mahout Math
> >>>Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
> >>> ===
> >>>
> >>> In MLlib, we should include the algorithms users know how to use and
> >>> we can provide support rather than letting algorithms come and go.
> >>>
> >>> My $0.02,
> >>> Xiangrui
> >>>
> >>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen 
> wrote:
> >>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown 
> wrote:
> >>>>> - MLlib as Mahout.next would be a unfortunate.  There are some gems
> in
> >>>>> Mahout, but there are also lots of rocks.  Setting a minimal bar of
> >>>>> working, correctly implemented, and documented requires a surprising
> >>> amount
> >>>>> of work.
> >>>>
> >>>> As someone with first-hand knowledge, this is correct. To Sang's
> >>>> question, I can't see value in 'porting' Mahout since it is based on a
> >>>> quite different paradigm. About the only part that translates is the
> >>>> algorithm concept itself.
> >>>>
> >>>> This is also the cautionary tale. The contents of the project have
> >>>> ended up being a number of "drive-by" contributions of implementations
> >>>> that, while individually perhaps brilliant (perhaps), didn't
> >>>> necessarily match any other implementation in structure, input/output,
> >>>> libraries used. The implementations were often a touch academic. The
> >>>> result was hard to document, maintain, evolve or use.
> >>>>
> >>>> Far more of the structure of the MLlib implementations are consistent
> >>>> by virtue of being built around Spark core already. That's great.
> >>>>
> >>>> One can't wait to completely build the foundation before building any
> >>>> implementations. To me, the existing implementations are almost
> >>>> exactly the basics I would choose. They cover the bases and will
> >>>> exercise the abstractions and structure. So that's also great IMHO.
> >>>
>


Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
I thought this might be a good thing to add to the wiki's "How to
contribute" 
page<https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark>,
as it's not tied to a release.


On Mon, Apr 21, 2014 at 6:09 PM, Xiangrui Meng  wrote:

> The markdown files are under spark/docs. You can submit a PR for
> changes. -Xiangrui
>
> On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza 
> wrote:
> > How do I get permissions to edit the wiki?
> >
> >
> > On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng  wrote:
> >
> >> Cannot agree more with your words. Could you add one section about
> >> "how and what to contribute" to MLlib's guide? -Xiangrui
> >>
> >> On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
> >>  wrote:
> >> > I'd say a section in the "how to contribute" page would be a good
> place
> >> to put this.
> >> >
> >> > In general I'd say that the criteria for inclusion of an algorithm is
> it
> >> should be high quality, widely known, used and accepted (citations and
> >> concrete use cases as examples of this), scalable and parallelizable,
> well
> >> documented and with reasonable expectation of dev support
> >> >
> >> > Sent from my iPhone
> >> >
> >> >> On 21 Apr 2014, at 19:59, Sandy Ryza 
> wrote:
> >> >>
> >> >> If it's not done already, would it make sense to codify this
> philosophy
> >> >> somewhere?  I imagine this won't be the first time this discussion
> comes
> >> >> up, and it would be nice to have a doc to point to.  I'd be happy to
> >> take a
> >> >> stab at this.
> >> >>
> >> >>
> >> >>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng 
> >> wrote:
> >> >>>
> >> >>> +1 on Sean's comment. MLlib covers the basic algorithms but we
> >> >>> definitely need to spend more time on how to make the design
> scalable.
> >> >>> For example, think about current "ProblemWithAlgorithm" naming
> scheme.
> >> >>> That being said, new algorithms are welcomed. I wish they are
> >> >>> well-established and well-understood by users. They shouldn't be
> >> >>> research algorithms tuned to work well with a particular dataset but
> >> >>> not tested widely. You see the change log from Mahout:
> >> >>>
> >> >>> ===
> >> >>> The following algorithms that were marked deprecated in 0.8 have
> been
> >> >>> removed in 0.9:
> >> >>>
> >> >>> From Clustering:
> >> >>>  Switched LDA implementation from using Gibbs Sampling to Collapsed
> >> >>> Variational Bayes (CVB)
> >> >>> Meanshift
> >> >>> MinHash - removed due to poor performance, lack of support and lack
> of
> >> >>> usage
> >> >>>
> >> >>> From Classification (both are sequential implementations)
> >> >>> Winnow - lack of actual usage and support
> >> >>> Perceptron - lack of actual usage and support
> >> >>>
> >> >>> Collaborative Filtering
> >> >>>SlopeOne implementations in
> >> >>> org.apache.mahout.cf.taste.hadoop.slopeone and
> >> >>> org.apache.mahout.cf.taste.impl.recommender.slopeone
> >> >>>Distributed pseudo recommender in
> >> >>> org.apache.mahout.cf.taste.hadoop.pseudo
> >> >>>TreeClusteringRecommender in
> >> >>> org.apache.mahout.cf.taste.impl.recommender
> >> >>>
> >> >>> Mahout Math
> >> >>>Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
> >> >>> ===
> >> >>>
> >> >>> In MLlib, we should include the algorithms users know how to use and
> >> >>> we can provide support rather than letting algorithms come and go.
> >> >>>
> >> >>> My $0.02,
> >> >>> Xiangrui
> >> >>>
> >> >>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen 
> >> wrote:
> >> >>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown 
> >> wrote:
> >> >>>>> - MLlib as Mahout.next would be a unfortunate.  There are some
> gems
> >> in
>

Re: Any plans for new clustering algorithms?

2014-04-22 Thread Sandy Ryza
Thanks Matei.  I added a section "How to contribute" page.


On Mon, Apr 21, 2014 at 7:25 PM, Matei Zaharia wrote:

> The wiki is actually maintained separately in
> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We
> restricted editing of the wiki because bots would automatically add stuff.
> I've given you permissions now.
>
> Matei
>
> On Apr 21, 2014, at 6:22 PM, Nan Zhu  wrote:
>
> > I thought those are files of spark.apache.org?
> >
> > --
> > Nan Zhu
> >
> >
> > On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote:
> >
> >> The markdown files are under spark/docs. You can submit a PR for
> >> changes. -Xiangrui
> >>
> >> On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza 
> >>  sandy.r...@cloudera.com)> wrote:
> >>> How do I get permissions to edit the wiki?
> >>>
> >>>
> >>> On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng  men...@gmail.com)> wrote:
> >>>
> >>>> Cannot agree more with your words. Could you add one section about
> >>>> "how and what to contribute" to MLlib's guide? -Xiangrui
> >>>>
> >>>> On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
> >>>> mailto:nick.pentre...@gmail.com)> wrote:
> >>>>> I'd say a section in the "how to contribute" page would be a good
> place
> >>>>
> >>>> to put this.
> >>>>>
> >>>>> In general I'd say that the criteria for inclusion of an algorithm
> is it
> >>>> should be high quality, widely known, used and accepted (citations and
> >>>> concrete use cases as examples of this), scalable and parallelizable,
> well
> >>>> documented and with reasonable expectation of dev support
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>>> On 21 Apr 2014, at 19:59, Sandy Ryza  sandy.r...@cloudera.com)> wrote:
> >>>>>>
> >>>>>> If it's not done already, would it make sense to codify this
> philosophy
> >>>>>> somewhere? I imagine this won't be the first time this discussion
> comes
> >>>>>> up, and it would be nice to have a doc to point to. I'd be happy to
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> take a
> >>>>>> stab at this.
> >>>>>>
> >>>>>>
> >>>>>>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng 
> >>>>>>>  men...@gmail.com)>
> >>>> wrote:
> >>>>>>>
> >>>>>>> +1 on Sean's comment. MLlib covers the basic algorithms but we
> >>>>>>> definitely need to spend more time on how to make the design
> scalable.
> >>>>>>> For example, think about current "ProblemWithAlgorithm" naming
> scheme.
> >>>>>>> That being said, new algorithms are welcomed. I wish they are
> >>>>>>> well-established and well-understood by users. They shouldn't be
> >>>>>>> research algorithms tuned to work well with a particular dataset
> but
> >>>>>>> not tested widely. You see the change log from Mahout:
> >>>>>>>
> >>>>>>> ===
> >>>>>>> The following algorithms that were marked deprecated in 0.8 have
> been
> >>>>>>> removed in 0.9:
> >>>>>>>
> >>>>>>> From Clustering:
> >>>>>>> Switched LDA implementation from using Gibbs Sampling to Collapsed
> >>>>>>> Variational Bayes (CVB)
> >>>>>>> Meanshift
> >>>>>>> MinHash - removed due to poor performance, lack of support and
> lack of
> >>>>>>> usage
> >>>>>>>
> >>>>>>> From Classification (both are sequential implementations)
> >>>>>>> Winnow - lack of actual usage and support
> >>>>>>> Perceptron - lack of actual usage and support
> >>>>>>>
> >>>>>>> Collaborative Filtering
> >>>>>>> SlopeOne implementations in
> >>>>>>> org.apache.mahout.cf.taste.hadoop.slopeone and
> >>>>>>> org.apache.mahout.cf.taste.i

Re: Apache Spark running out of the spark shell

2014-05-03 Thread Sandy Ryza
Hi AJ,

You might find this helpful -
http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/

-Sandy


On Sat, May 3, 2014 at 8:42 AM, Ajay Nair  wrote:

> Hi,
>
> I have written a code that works just about fine in the spark shell on EC2.
> The ec2 script helped me configure my master and worker nodes. Now I want
> to
> run the scala-spark code out side the interactive shell. How do I go about
> doing it.
>
> I was referring to the instructions mentioned here:
> https://spark.apache.org/docs/0.9.1/quick-start.html
>
> But this is confusing because it mentions about a simple project jar file
> which I am not sure how to generate. I only have the file that runs
> directly
> on my spark shell. Any easy intruction to get this quickly running as a
> job?
>
> Thanks
> AJ
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-running-out-of-the-spark-shell-tp6459.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-15 Thread Sandy Ryza
+1 (non-binding)

* Built the release from source.
* Compiled Java and Scala apps that interact with HDFS against it.
* Ran them in local mode.
* Ran them against a pseudo-distributed YARN cluster in both yarn-client
mode and yarn-cluster mode.


On Tue, May 13, 2014 at 9:09 PM, witgo  wrote:

> You need to set:
> spark.akka.frameSize 5
> spark.default.parallelism1
>
>
>
>
>
> -- Original --
> From:  "Madhu";;
> Date:  Wed, May 14, 2014 09:15 AM
> To:  "dev";
>
> Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
>
>
>
> I just built rc5 on Windows 7 and tried to reproduce the problem described
> in
>
> https://issues.apache.org/jira/browse/SPARK-1712
>
> It works on my machine:
>
> 14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at :17) finished
> in 4.548 s
> 14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
> have all completed, from pool
> 14/05/13 21:06:47 INFO SparkContext: Job finished: sum at :17,
> took
> 4.814991993 s
> res1: Double = 5.05E11
>
> I used all defaults, no config files were changed.
> Not sure if that makes a difference...
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6560.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> .


Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-17 Thread Sandy Ryza
+1

Reran my tests from rc5:

* Built the release from source.
* Compiled Java and Scala apps that interact with HDFS against it.
* Ran them in local mode.
* Ran them against a pseudo-distributed YARN cluster in both yarn-client
mode and yarn-cluster mode.


On Sat, May 17, 2014 at 10:08 AM, Andrew Or  wrote:

> +1
>
>
> 2014-05-17 8:53 GMT-07:00 Mark Hamstra :
>
> > +1
> >
> >
> > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell  > >wrote:
> >
> > > I'll start the voting with a +1.
> > >
> > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell 
> > > wrote:
> > > > Please vote on releasing the following candidate as Apache Spark
> > version
> > > 1.0.0!
> > > > This has one bug fix and one minor feature on top of rc8:
> > > > SPARK-1864: https://github.com/apache/spark/pull/808
> > > > SPARK-1808: https://github.com/apache/spark/pull/799
> > > >
> > > > The tag to be voted on is v1.0.0-rc9 (commit 920f947):
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75
> > > >
> > > > The release files, including signatures, digests, etc. can be found
> at:
> > > > http://people.apache.org/~pwendell/spark-1.0.0-rc9/
> > > >
> > > > Release artifacts are signed with the following key:
> > > > https://people.apache.org/keys/committer/pwendell.asc
> > > >
> > > > The staging repository for this release can be found at:
> > > >
> > https://repository.apache.org/content/repositories/orgapachespark-1017/
> > > >
> > > > The documentation corresponding to this release can be found at:
> > > > http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/
> > > >
> > > > Please vote on releasing this package as Apache Spark 1.0.0!
> > > >
> > > > The vote is open until Tuesday, May 20, at 08:56 UTC and passes if
> > > > amajority of at least 3 +1 PMC votes are cast.
> > > >
> > > > [ ] +1 Release this package as Apache Spark 1.0.0
> > > > [ ] -1 Do not release this package because ...
> > > >
> > > > To learn more about Apache Spark, please see
> > > > http://spark.apache.org/
> > > >
> > > > == API Changes ==
> > > > We welcome users to compile Spark applications against 1.0. There are
> > > > a few API changes in this release. Here are links to the associated
> > > > upgrade guides - user facing changes have been kept as small as
> > > > possible.
> > > >
> > > > changes to ML vector specification:
> > > >
> > >
> >
> http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10
> > > >
> > > > changes to the Java API:
> > > >
> > >
> >
> http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
> > > >
> > > > changes to the streaming API:
> > > >
> > >
> >
> http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
> > > >
> > > > changes to the GraphX API:
> > > >
> > >
> >
> http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
> > > >
> > > > coGroup and related functions now return Iterable[T] instead of
> Seq[T]
> > > > ==> Call toSeq on the result to restore the old behavior
> > > >
> > > > SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> > > > ==> Call toSeq on the result to restore old behavior
> > >
> >
>


Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Sandy Ryza
I spoke with DB offline about this a little while ago and he confirmed that
he was able to access the jar from the driver.

The issue appears to be a general Java issue: you can't directly
instantiate a class from a dynamically loaded jar.

I reproduced it locally outside of Spark with:
---
URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
File("myotherjar.jar").toURI().toURL() }, null);
Thread.currentThread().setContextClassLoader(urlClassLoader);
MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
---

I was able to load the class with reflection.



On Sun, May 18, 2014 at 11:58 AM, Patrick Wendell wrote:

> @db - it's possible that you aren't including the jar in the classpath
> of your driver program (I think this is what mridul was suggesting).
> It would be helpful to see the stack trace of the CNFE.
>
> - Patrick
>
> On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell 
> wrote:
> > @xiangrui - we don't expect these to be present on the system
> > classpath, because they get dynamically added by Spark (e.g. your
> > application can call sc.addJar well after the JVM's have started).
> >
> > @db - I'm pretty surprised to see that behavior. It's definitely not
> > intended that users need reflection to instantiate their classes -
> > something odd is going on in your case. If you could create an
> > isolated example and post it to the JIRA, that would be great.
> >
> > On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng  wrote:
> >> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870
> >>
> >> DB, could you add more info to that JIRA? Thanks!
> >>
> >> -Xiangrui
> >>
> >> On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng 
> wrote:
> >>> Btw, I tried
> >>>
> >>> rdd.map { i =>
> >>>   System.getProperty("java.class.path")
> >>> }.collect()
> >>>
> >>> but didn't see the jars added via "--jars" on the executor classpath.
> >>>
> >>> -Xiangrui
> >>>
> >>> On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng 
> wrote:
>  I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
>  reflection approach mentioned by DB didn't work either. I checked the
>  distributed cache on a worker node and found the jar there. It is also
>  in the Environment tab of the WebUI. The workaround is making an
>  assembly jar.
> 
>  DB, could you create a JIRA and describe what you have found so far?
> Thanks!
> 
>  Best,
>  Xiangrui
> 
>  On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan <
> mri...@gmail.com> wrote:
> > Can you try moving your mapPartitions to another class/object which
> is
> > referenced only after sc.addJar ?
> >
> > I would suspect CNFEx is coming while loading the class containing
> > mapPartitions before addJars is executed.
> >
> > In general though, dynamic loading of classes means you use
> reflection to
> > instantiate it since expectation is you don't know which
> implementation
> > provides the interface ... If you statically know it apriori, you
> bundle it
> > in your classpath.
> >
> > Regards
> > Mridul
> > On 17-May-2014 7:28 am, "DB Tsai"  wrote:
> >
> >> Finally find a way out of the ClassLoader maze! It took me some
> times to
> >> understand how it works; I think it worths to document it in a
> separated
> >> thread.
> >>
> >> We're trying to add external utility.jar which contains
> CSVRecordParser,
> >> and we added the jar to executors through sc.addJar APIs.
> >>
> >> If the instance of CSVRecordParser is created without reflection, it
> >> raises *ClassNotFound
> >> Exception*.
> >>
> >> data.mapPartitions(lines => {
> >> val csvParser = new CSVRecordParser((delimiter.charAt(0))
> >> lines.foreach(line => {
> >>   val lineElems = csvParser.parseLine(line)
> >> })
> >> ...
> >> ...
> >>  )
> >>
> >>
> >> If the instance of CSVRecordParser is created through reflection,
> it works.
> >>
> >> data.mapPartitions(lines => {
> >> val loader = Thread.currentThread.getContextClassLoader
> >> val CSVRecordParser =
> >> loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
> >>
> >> val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
> >> .newInstance(delimiter.charAt(0).asInstanceOf[Character])
> >>
> >> val parseLine = CSVRecordParser
> >> .getDeclaredMethod("parseLine", classOf[String])
> >>
> >> lines.foreach(line => {
> >>val lineElems = parseLine.invoke(csvParser,
> >> line).asInstanceOf[Array[String]]
> >> })
> >> ...
> >> ...
> >>  )
> >>
> >>
> >> This is identical to this question,
> >>
> >>
> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
> >>
> >> It's not intuitive for users to 

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Sandy Ryza
Hey Xiangrui,

If the jars are placed in the distributed cache and loaded statically, as
the primary app jar is in YARN, then it shouldn't be an issue.  Other jars,
however, including additional jars that are sc.addJar'd and jars specified
with the spark-submit --jars argument, are loaded dynamically by executors
with a URLClassLoader.  These jars aren't next to the executors when they
start - the executors fetch them from the driver's HTTP server.


On Sun, May 18, 2014 at 4:05 PM, Xiangrui Meng  wrote:

> Hi Sandy,
>
> It is hard to imagine that a user needs to create an object in that
> way. Since the jars are already in distributed cache before the
> executor starts, is there any reason we cannot add the locally cached
> jars to classpath directly?
>
> Best,
> Xiangrui
>
> On Sun, May 18, 2014 at 4:00 PM, Sandy Ryza 
> wrote:
> > I spoke with DB offline about this a little while ago and he confirmed
> that
> > he was able to access the jar from the driver.
> >
> > The issue appears to be a general Java issue: you can't directly
> > instantiate a class from a dynamically loaded jar.
> >
> > I reproduced it locally outside of Spark with:
> > ---
> > URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
> > File("myotherjar.jar").toURI().toURL() }, null);
> > Thread.currentThread().setContextClassLoader(urlClassLoader);
> > MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> > ---
> >
> > I was able to load the class with reflection.
> >
> >
> >
> > On Sun, May 18, 2014 at 11:58 AM, Patrick Wendell  >wrote:
> >
> >> @db - it's possible that you aren't including the jar in the classpath
> >> of your driver program (I think this is what mridul was suggesting).
> >> It would be helpful to see the stack trace of the CNFE.
> >>
> >> - Patrick
> >>
> >> On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell 
> >> wrote:
> >> > @xiangrui - we don't expect these to be present on the system
> >> > classpath, because they get dynamically added by Spark (e.g. your
> >> > application can call sc.addJar well after the JVM's have started).
> >> >
> >> > @db - I'm pretty surprised to see that behavior. It's definitely not
> >> > intended that users need reflection to instantiate their classes -
> >> > something odd is going on in your case. If you could create an
> >> > isolated example and post it to the JIRA, that would be great.
> >> >
> >> > On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng 
> wrote:
> >> >> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870
> >> >>
> >> >> DB, could you add more info to that JIRA? Thanks!
> >> >>
> >> >> -Xiangrui
> >> >>
> >> >> On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng 
> >> wrote:
> >> >>> Btw, I tried
> >> >>>
> >> >>> rdd.map { i =>
> >> >>>   System.getProperty("java.class.path")
> >> >>> }.collect()
> >> >>>
> >> >>> but didn't see the jars added via "--jars" on the executor
> classpath.
> >> >>>
> >> >>> -Xiangrui
> >> >>>
> >> >>> On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng 
> >> wrote:
> >> >>>> I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
> >> >>>> reflection approach mentioned by DB didn't work either. I checked
> the
> >> >>>> distributed cache on a worker node and found the jar there. It is
> also
> >> >>>> in the Environment tab of the WebUI. The workaround is making an
> >> >>>> assembly jar.
> >> >>>>
> >> >>>> DB, could you create a JIRA and describe what you have found so
> far?
> >> Thanks!
> >> >>>>
> >> >>>> Best,
> >> >>>> Xiangrui
> >> >>>>
> >> >>>> On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan <
> >> mri...@gmail.com> wrote:
> >> >>>>> Can you try moving your mapPartitions to another class/object
> which
> >> is
> >> >>>>> referenced only after sc.addJar ?
> >> >>>>>
> >> >>>>> I would suspect CNFEx

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread Sandy Ryza
ely,
> >>>
> >>> DB Tsai
> >>> ---
> >>> My Blog: https://www.dbtsai.com
> >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >>>
> >>>
> >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen 
> wrote:
> >>>
> >>> > I might be stating the obvious for everyone, but the issue here is
> not
> >>> > reflection or the source of the JAR, but the ClassLoader. The basic
> >>> > rules are this.
> >>> >
> >>> > "new Foo" will use the ClassLoader that defines Foo. This is usually
> >>> > the ClassLoader that loaded whatever it is that first referenced Foo
> >>> > and caused it to be loaded -- usually the ClassLoader holding your
> >>> > other app classes.
> >>> >
> >>> > ClassLoaders can have a parent-child relationship. ClassLoaders
> always
> >>> > look in their parent before themselves.
> >>> >
> >>> > (Careful then -- in contexts like Hadoop or Tomcat where your app is
> >>> > loaded in a child ClassLoader, and you reference a class that Hadoop
> >>> > or Tomcat also has (like a lib class) you will get the container's
> >>> > version!)
> >>> >
> >>> > When you load an external JAR it has a separate ClassLoader which
> does
> >>> > not necessarily bear any relation to the one containing your app
> >>> > classes, so yeah it is not generally going to make "new Foo" work.
> >>> >
> >>> > Reflection lets you pick the ClassLoader, yes.
> >>> >
> >>> > I would not call setContextClassLoader.
> >>> >
> >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
> sandy.r...@cloudera.com>
> >>> > wrote:
> >>> > > I spoke with DB offline about this a little while ago and he
> confirmed
> >>> > that
> >>> > > he was able to access the jar from the driver.
> >>> > >
> >>> > > The issue appears to be a general Java issue: you can't directly
> >>> > > instantiate a class from a dynamically loaded jar.
> >>> > >
> >>> > > I reproduced it locally outside of Spark with:
> >>> > > ---
> >>> > > URLClassLoader urlClassLoader = new URLClassLoader(new URL[] {
> new
> >>> > > File("myotherjar.jar").toURI().toURL() }, null);
> >>> > > Thread.currentThread().setContextClassLoader(urlClassLoader);
> >>> > > MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> >>> > > ---
> >>> > >
> >>> > > I was able to load the class with reflection.
> >>> >
> >>>
>


Re: Sorting partitions in Java

2014-05-20 Thread Sandy Ryza
sortByKey currently requires partitions to fit in memory, but there are
plans to add external sort


On Tue, May 20, 2014 at 10:10 AM, Madhu  wrote:

> Thanks Sean, I had seen that post you mentioned.
>
> What you suggest looks an in-memory sort, which is fine if each partition
> is
> small enough to fit in memory. Is it true that rdd.sortByKey(...) requires
> partitions to fit in memory? I wasn't sure if there was some magic behind
> the scenes that supports arbitrarily large sorts.
>
> None of this is a show stopper, it just might require a little more code on
> the part of the developer. If there's a requirement for Spark partitions to
> fit in memory, developers will have to be aware of that and plan
> accordingly. One nice feature of Hadoop MR is the ability to sort very
> large
> sets without thinking about data size.
>
> In the case that a developer repartitions an RDD such that some partitions
> don't fit in memory, sorting those partitions requires more work. For these
> cases, I think there is value in having a robust partition sorting method
> that deals with it efficiently and reliably.
>
> Is there another solution for sorting arbitrarily large partitions? If not,
> I don't mind developing and contributing a solution.
>
>
>
>
> -
> --
> Madhu
> https://www.linkedin.com/in/msiddalingaiah
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Sorting-partitions-in-Java-tp6715p6719.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


Re: Sorting partitions in Java

2014-05-20 Thread Sandy Ryza
There is: SPARK-545


On Tue, May 20, 2014 at 10:16 AM, Andrew Ash  wrote:

> Sandy, is there a Jira ticket for that?
>
>
> On Tue, May 20, 2014 at 10:12 AM, Sandy Ryza  >wrote:
>
> > sortByKey currently requires partitions to fit in memory, but there are
> > plans to add external sort
> >
> >
> > On Tue, May 20, 2014 at 10:10 AM, Madhu  wrote:
> >
> > > Thanks Sean, I had seen that post you mentioned.
> > >
> > > What you suggest looks an in-memory sort, which is fine if each
> partition
> > > is
> > > small enough to fit in memory. Is it true that rdd.sortByKey(...)
> > requires
> > > partitions to fit in memory? I wasn't sure if there was some magic
> behind
> > > the scenes that supports arbitrarily large sorts.
> > >
> > > None of this is a show stopper, it just might require a little more
> code
> > on
> > > the part of the developer. If there's a requirement for Spark
> partitions
> > to
> > > fit in memory, developers will have to be aware of that and plan
> > > accordingly. One nice feature of Hadoop MR is the ability to sort very
> > > large
> > > sets without thinking about data size.
> > >
> > > In the case that a developer repartitions an RDD such that some
> > partitions
> > > don't fit in memory, sorting those partitions requires more work. For
> > these
> > > cases, I think there is value in having a robust partition sorting
> method
> > > that deals with it efficiently and reliably.
> > >
> > > Is there another solution for sorting arbitrarily large partitions? If
> > not,
> > > I don't mind developing and contributing a solution.
> > >
> > >
> > >
> > >
> > > -
> > > --
> > > Madhu
> > > https://www.linkedin.com/in/msiddalingaiah
> > > --
> > > View this message in context:
> > >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Sorting-partitions-in-Java-tp6715p6719.html
> > > Sent from the Apache Spark Developers List mailing list archive at
> > > Nabble.com.
> > >
> >
>


Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Sandy Ryza
+1


On Tue, May 20, 2014 at 5:26 PM, Andrew Or  wrote:

> +1
>
>
> 2014-05-20 13:13 GMT-07:00 Tathagata Das :
>
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.0.0!
> >
> > This has a few bug fixes on top of rc9:
> > SPARK-1875: https://github.com/apache/spark/pull/824
> > SPARK-1876: https://github.com/apache/spark/pull/819
> > SPARK-1878: https://github.com/apache/spark/pull/822
> > SPARK-1879: https://github.com/apache/spark/pull/823
> >
> > The tag to be voted on is v1.0.0-rc10 (commit d8070234):
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d807023479ce10aec28ef3c1ab646ddefc2e663c
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~tdas/spark-1.0.0-rc10/
> >
> > The release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/tdas.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1018/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/
> >
> > The full list of changes in this release can be found at:
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=d807023479ce10aec28ef3c1ab646ddefc2e663c
> >
> > Please vote on releasing this package as Apache Spark 1.0.0!
> >
> > The vote is open until Friday, May 23, at 20:00 UTC and passes if
> > amajority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.0.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
> > == API Changes ==
> > We welcome users to compile Spark applications against 1.0. There are
> > a few API changes in this release. Here are links to the associated
> > upgrade guides - user facing changes have been kept as small as
> > possible.
> >
> > Changes to ML vector specification:
> >
> >
> http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/mllib-guide.html#from-09-to-10
> >
> > Changes to the Java API:
> >
> >
> http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
> >
> > Changes to the streaming API:
> >
> >
> http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
> >
> > Changes to the GraphX API:
> >
> >
> http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
> >
> > Other changes:
> > coGroup and related functions now return Iterable[T] instead of Seq[T]
> > ==> Call toSeq on the result to restore the old behavior
> >
> > SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> > ==> Call toSeq on the result to restore old behavior
> >
>


Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Sandy Ryza
This will solve the issue for jars added upon application submission, but,
on top of this, we need to make sure that anything dynamically added
through sc.addJar works as well.

To do so, we need to make sure that any jars retrieved via the driver's
HTTP server are loaded by the same classloader that loads the jars given on
app submission.  To achieve this, we need to either use the same
classloader for both system jars and user jars, or make sure that the user
jars given on app submission are under the same classloader used for
dynamically added jars.

On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng  wrote:

> Talked with Sandy and DB offline. I think the best solution is sending
> the secondary jars to the distributed cache of all containers rather
> than just the master, and set the classpath to include spark jar,
> primary app jar, and secondary jars before executor starts. In this
> way, user only needs to specify secondary jars via --jars instead of
> calling sc.addJar inside the code. It also solves the scalability
> problem of serving all the jars via http.
>
> If this solution sounds good, I can try to make a patch.
>
> Best,
> Xiangrui
>
> On Mon, May 19, 2014 at 10:04 PM, DB Tsai  wrote:
> > In 1.0, there is a new option for users to choose which classloader has
> > higher priority via spark.files.userClassPathFirst, I decided to submit
> the
> > PR for 0.9 first. We use this patch in our lab and we can use those jars
> > added by sc.addJar without reflection.
> >
> > https://github.com/apache/spark/pull/834
> >
> > Can anyone comment if it's a good approach?
> >
> > Thanks.
> >
> >
> > Sincerely,
> >
> > DB Tsai
> > ---
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Mon, May 19, 2014 at 7:42 PM, DB Tsai  wrote:
> >
> >> Good summary! We fixed it in branch 0.9 since our production is still in
> >> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
> >> tonight.
> >>
> >>
> >> Sincerely,
> >>
> >> DB Tsai
> >> ---
> >> My Blog: https://www.dbtsai.com
> >> LinkedIn: https://www.linkedin.com/in/dbtsai
> >>
> >>
> >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza  >wrote:
> >>
> >>> It just hit me why this problem is showing up on YARN and not on
> >>> standalone.
> >>>
> >>> The relevant difference between YARN and standalone is that, on YARN,
> the
> >>> app jar is loaded by the system classloader instead of Spark's custom
> URL
> >>> classloader.
> >>>
> >>> On YARN, the system classloader knows about [the classes in the spark
> >>> jars,
> >>> the classes in the primary app jar].   The custom classloader knows
> about
> >>> [the classes in secondary app jars] and has the system classloader as
> its
> >>> parent.
> >>>
> >>> A few relevant facts (mostly redundant with what Sean pointed out):
> >>> * Every class has a classloader that loaded it.
> >>> * When an object of class B is instantiated inside of class A, the
> >>> classloader used for loading B is the classloader that was used for
> >>> loading
> >>> A.
> >>> * When a classloader fails to load a class, it lets its parent
> classloader
> >>> try.  If its parent succeeds, its parent becomes the "classloader that
> >>> loaded it".
> >>>
> >>> So suppose class B is in a secondary app jar and class A is in the
> primary
> >>> app jar:
> >>> 1. The custom classloader will try to load class A.
> >>> 2. It will fail, because it only knows about the secondary jars.
> >>> 3. It will delegate to its parent, the system classloader.
> >>> 4. The system classloader will succeed, because it knows about the
> primary
> >>> app jar.
> >>> 5. A's classloader will be the system classloader.
> >>> 6. A tries to instantiate an instance of class B.
> >>> 7. B will be loaded with A's classloader, which is the system
> classloader.
> >>> 8. Loading B will fail, because A's classloader, which is the system
> >>> classloader, doesn't know about the secondary app jars.
> >>>
> >>> In Spark standalone, A and B are both loaded by the custom
> classload

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Sandy Ryza
Is that an assumption we can make?  I think we'd run into an issue in this
situation:

*In primary jar:*
def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance()

*In app code:*
sc.addJar("dynamicjar.jar")
...
rdd.map(x => makeDynamicObject("some.class.from.DynamicJar"))

It might be fair to say that the user should make sure to use the context
classloader when instantiating dynamic classes, but I think it's weird that
this code would work on Spark standalone but not on YARN.

-Sandy


On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng  wrote:

> I think adding jars dynamically should work as long as the primary jar
> and the secondary jars do not depend on dynamically added jars, which
> should be the correct logic. -Xiangrui
>
> On Wed, May 21, 2014 at 1:40 PM, DB Tsai  wrote:
> > This will be another separate story.
> >
> > Since in the yarn deployment, as Sandy said, the app.jar will be always
> in
> > the systemclassloader which means any object instantiated in app.jar will
> > have parent loader of systemclassloader instead of custom one. As a
> result,
> > the custom class loader in yarn will never work without specifically
> using
> > reflection.
> >
> > Solution will be not using system classloader in the classloader
> hierarchy,
> > and add all the resources in system one into custom one. This is the
> > approach of tomcat takes.
> >
> > Or we can directly overwirte the system class loader by calling the
> > protected method `addURL` which will not work and throw exception if the
> > code is wrapped in security manager.
> >
> >
> > Sincerely,
> >
> > DB Tsai
> > ---
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza 
> wrote:
> >
> >> This will solve the issue for jars added upon application submission,
> but,
> >> on top of this, we need to make sure that anything dynamically added
> >> through sc.addJar works as well.
> >>
> >> To do so, we need to make sure that any jars retrieved via the driver's
> >> HTTP server are loaded by the same classloader that loads the jars
> given on
> >> app submission.  To achieve this, we need to either use the same
> >> classloader for both system jars and user jars, or make sure that the
> user
> >> jars given on app submission are under the same classloader used for
> >> dynamically added jars.
> >>
> >> On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng 
> wrote:
> >>
> >> > Talked with Sandy and DB offline. I think the best solution is sending
> >> > the secondary jars to the distributed cache of all containers rather
> >> > than just the master, and set the classpath to include spark jar,
> >> > primary app jar, and secondary jars before executor starts. In this
> >> > way, user only needs to specify secondary jars via --jars instead of
> >> > calling sc.addJar inside the code. It also solves the scalability
> >> > problem of serving all the jars via http.
> >> >
> >> > If this solution sounds good, I can try to make a patch.
> >> >
> >> > Best,
> >> > Xiangrui
> >> >
> >> > On Mon, May 19, 2014 at 10:04 PM, DB Tsai 
> wrote:
> >> > > In 1.0, there is a new option for users to choose which classloader
> has
> >> > > higher priority via spark.files.userClassPathFirst, I decided to
> submit
> >> > the
> >> > > PR for 0.9 first. We use this patch in our lab and we can use those
> >> jars
> >> > > added by sc.addJar without reflection.
> >> > >
> >> > > https://github.com/apache/spark/pull/834
> >> > >
> >> > > Can anyone comment if it's a good approach?
> >> > >
> >> > > Thanks.
> >> > >
> >> > >
> >> > > Sincerely,
> >> > >
> >> > > DB Tsai
> >> > > ---
> >> > > My Blog: https://www.dbtsai.com
> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
> >> > >
> >> > >
> >> > > On Mon, May 19, 2014 at 7:42 PM, DB Tsai 
> wrote:
> >> > >
> >> > >> Good summary! We fixed it in branch 0.9 since our production is
> still
> >> in
> >> > >>

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-26 Thread Sandy Ryza
+1


On Mon, May 26, 2014 at 7:38 AM, Tathagata Das
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.0.0!
>
> This has a few important bug fixes on top of rc10:
> SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
> SPARK-1870 :
> https://github.com/apache/spark/pull/848
> SPARK-1897: https://github.com/apache/spark/pull/849
>
> The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~tdas/spark-1.0.0-rc11/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/tdas.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1019/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/
>
> Please vote on releasing this package as Apache Spark 1.0.0!
>
> The vote is open until Thursday, May 29, at 16:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.0.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == API Changes ==
> We welcome users to compile Spark applications against 1.0. There are
> a few API changes in this release. Here are links to the associated
> upgrade guides - user facing changes have been kept as small as
> possible.
>
> Changes to ML vector specification:
>
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10
>
> Changes to the Java API:
>
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>
> Changes to the streaming API:
>
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
>
> Changes to the GraphX API:
>
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
>
> Other changes:
> coGroup and related functions now return Iterable[T] instead of Seq[T]
> ==> Call toSeq on the result to restore the old behavior
>
> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> ==> Call toSeq on the result to restore old behavior
>


Re: Please change instruction about "Launching Applications Inside the Cluster"

2014-05-30 Thread Sandy Ryza
They should be - in the sense that the docs now recommend using
spark-submit and thus include entirely different invocations.


On Fri, May 30, 2014 at 12:46 AM, Reynold Xin  wrote:

> Can you take a look at the latest Spark 1.0 docs and see if they are fixed?
>
> https://github.com/apache/spark/tree/master/docs
>
> Thanks.
>
>
> On Thu, May 29, 2014 at 5:29 AM, Lizhengbing (bing, BIPA) <
> zhengbing...@huawei.com> wrote:
>
> > The instruction address is in
> >
> http://spark.apache.org/docs/0.9.0/spark-standalone.html#launching-applications-inside-the-cluster
> > or
> >
> http://spark.apache.org/docs/0.9.1/spark-standalone.html#launching-applications-inside-the-cluster
> >
> > Origin instruction is:
> > "./bin/spark-class org.apache.spark.deploy.Client launch
> >[client-options] \
> >   \
> >[application-options] "
> >
> > If I follow this instruction, I will not run my program deployed in a
> > spark standalone cluster properly.
> >
> > Based on source code, This instruction should be changed to
> > "./bin/spark-class org.apache.spark.deploy.Client [client-options]
> launch \
> >   \
> >[application-options] "
> >
> > That is to say: [client-options] must be put ahead of launch
> >
>


Re: Contributing to MLlib on GLM

2014-06-17 Thread Sandy Ryza
Hi Xiaokai,

I think MLLib is definitely interested in supporting additional GLMs.  I'm
not aware of anybody working on this at the moment.

-Sandy


On Tue, Jun 17, 2014 at 5:00 PM, Xiaokai Wei  wrote:

> Hi,
>
> I am an intern at PalantirTech and we are building some stuff on top of
> MLlib. In Particular, GLM is of great interest to us.  Though
> GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as
> Logistic Regression, Linear Regression, some other important GLMs like
> Poisson Regression are still missing.
>
> I am curious that if anyone is already working on other GLMs (e.g.
> Poisson, Gamma). If not, we would like to contribute to MLlib on GLM. Is
> adding more GLMs on the roadmap of MLlib?
>
>
> Sincerely,
>
> Xiaokai
>


Re: Data Locality In Spark

2014-07-08 Thread Sandy Ryza
Hi Anish,

Spark, like MapReduce, makes an effort to schedule tasks on the same nodes
and racks that the input blocks reside on.

-Sandy


On Tue, Jul 8, 2014 at 12:27 PM, anishs...@yahoo.co.in <
anishs...@yahoo.co.in> wrote:

> Hi All
>
> My apologies for very basic question, do we have full support of data
> locality in Spark MapReduce.
>
> Please suggest.
>
> --
> Anish Sneh
> "Experience is the best teacher."
> http://in.linkedin.com/in/anishsneh
>
>


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Sandy Ryza
Having a common framework for clustering makes sense to me.  While we
should be careful about what algorithms we include, having solid
implementations of minibatch clustering and hierarchical clustering seems
like a worthwhile goal, and we should reuse as much code and APIs as
reasonable.


On Tue, Jul 8, 2014 at 1:19 PM, RJ Nowling  wrote:

> Thanks, Hector! Your feedback is useful.
>
> On Tuesday, July 8, 2014, Hector Yee  wrote:
>
> > I would say for bigdata applications the most useful would be
> hierarchical
> > k-means with back tracking and the ability to support k nearest
> centroids.
> >
> >
> > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling  > > wrote:
> >
> > > Hi all,
> > >
> > > MLlib currently has one clustering algorithm implementation, KMeans.
> > > It would benefit from having implementations of other clustering
> > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > > Clustering, and Affinity Propagation.
> > >
> > > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> > > and I saw an email on this list about interest in implementing Fuzzy
> > > C-Means.
> > >
> > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > > apparent that before I implement more clustering algorithms, it would
> > > be useful to hammer out a framework to reduce code duplication and
> > > implement a consistent API.
> > >
> > > I'd like to gauge the interest and goals of the MLlib community:
> > >
> > > 1. Are you interested in having more clustering algorithms available?
> > >
> > > 2. Is the community interested in specifying a common framework?
> > >
> > > Thanks!
> > > RJ
> > >
> > > [1] - https://github.com/apache/spark/pull/1248
> > >
> > >
> > > --
> > > em rnowl...@gmail.com 
> > > c 954.496.2314
> > >
> >
> >
> >
> > --
> > Yee Yang Li Hector 
> > *google.com/+HectorYee *
> >
>
>
> --
> em rnowl...@gmail.com
> c 954.496.2314
>


Re: Changes to sbt build have been merged

2014-07-10 Thread Sandy Ryza
Woot!


On Thu, Jul 10, 2014 at 11:15 AM, Patrick Wendell 
wrote:

> Just a heads up, we merged Prashant's work on having the sbt build read all
> dependencies from Maven. Please report any issues you find on the dev list
> or on JIRA.
>
> One note here for developers, going forward the sbt build will use the same
> configuration style as the maven build (-D for options and -P for maven
> profiles). So this will be a change for developers:
>
> sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly
>
> For now, we'll continue to support the old env-var options with a
> deprecation warning.
>
> - Patrick
>


Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Sandy Ryza
Stephen,
Often the shuffle is bound by writes to disk, so even if disks have enough
space to store the uncompressed data, the shuffle can complete faster by
writing less data.

Reynold,
This isn't a big help in the short term, but if we switch to a sort-based
shuffle, we'll only need a single LZFOutputStream per map task.


On Mon, Jul 14, 2014 at 3:30 PM, Stephen Haberman <
stephen.haber...@gmail.com> wrote:

>
> Just a comment from the peanut gallery, but these buffers are a real
> PITA for us as well. Probably 75% of our non-user-error job failures
> are related to them.
>
> Just naively, what about not doing compression on the fly? E.g. during
> the shuffle just write straight to disk, uncompressed?
>
> For us, we always have plenty of disk space, and if you're concerned
> about network transmission, you could add a separate compress step
> after the blocks have been written to disk, but before being sent over
> the wire.
>
> Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
> see work in this area!
>
> - Stephen
>
>


Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sandy Ryza
>
> /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:596:
> >>> not found: type AMRMClient
> >>>
> >>> [error]   amClient: AMRMClient[ContainerRequest],
> >>>
> >>> [error] ^
> >>>
> >>> [error]
> >>>
> /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:577:
> >>> not found: type AMRMClient
> >>>
> >>> [error]   amClient: AMRMClient[ContainerRequest],
> >>>
> >>> [error] ^
> >>>
> >>> [error]
> >>>
> /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:452:
> >>> value CONTAINER_ID is not a member of object
> >>> org.apache.hadoop.yarn.api.ApplicationConstants.Environment
> >>>
> >>> [error] val containerIdString = System.getenv(
> >>> ApplicationConstants.Environment.CONTAINER_ID.name())
> >>>
> >>> [error]
> >>>   ^
> >>>
> >>> [error]
> >>>
> /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:128:
> >>> value setTokens is not a member of
> >>> org.apache.hadoop.yarn.api.records.ContainerLaunchContext
> >>>
> >>> [error] amContainer.setTokens(ByteBuffer.wrap(dob.getData()))
> >>>
> >>> [error] ^
> >>>
> >>> [error]
> >>>
> /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala:36:
> >>> object api is not a member of package org.apache.hadoop.yarn.client
> >>>
> >>> [error] import org.apache.hadoop.yarn.client.api.AMRMClient
> >>>
> >>> [error]  ^
> >>>
> >>> [error]
> >>>
> /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala:37:
> >>> object api is not a member of package org.apache.hadoop.yarn.client
> >>>
> >>> [error] import
> >>> org.apache.hadoop.yarn.client.api.AMRMClient.ContainerRequest
> >>>
> >>> [error]  ^
> >>>
> >>> [error]
> >>>
> /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala:39:
> >>> object util is not a member of package org.apache.hadoop.yarn.webapp
> >>>
> >>> [error] import org.apache.hadoop.yarn.webapp.util.WebAppUtils
> >>>
> >>> [error]  ^
> >>>
> >>> [error]
> >>>
> /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala:62:
> >>> not found: type AMRMClient
> >>>
> >>> [error]   private var amClient: AMRMClient[ContainerRequest] = _
> >>>
> >>> [error] ^
> >>>
> >>> [error]
> >>>
> /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala:99:
> >>> not found: value AMRMClient
> >>>
> >>> [error] amClient = AMRMClient.createAMRMClient()
> >>>
> >>> [error]^
> >>>
> >>> [error]
> >>>
> /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ExecutorLauncher.scala:158:
> >>> not found: value WebAppUtils
> >>>
> >>> [error] val proxy = WebAppUtils.getProxyHostAndPort(conf)
> >>>
> >>> [error] ^
> >>>
> >>> [error]
> >>>
> /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala:31:
> >>> object ProtoUtils is not a member of package
> >>> org.apache.hadoop.yarn.api.records.impl.pb
> >>>
> >>> [error] import org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils
> >>>
> >>> [error]^
> >>>
> >>> [error]
> >>>
> /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala:33:
> >>> object api is not a member of package org.apache.hadoop.yarn.client
> >>>
&g

Re: Examples have SparkContext improperly labeled?

2014-07-21 Thread Sandy Ryza
Hi RJ,

Spark Shell instantiates a SparkContext for you named "sc".  In other apps,
the user instantiates it themself and can give the variable whatever name
they want, e.g. "spark".

-Sandy


On Mon, Jul 21, 2014 at 8:36 AM, RJ Nowling  wrote:

> Hi all,
>
> The examples listed here
>
> https://spark.apache.org/examples.html
>
> refer to the spark context as "spark" but when running Spark Shell
> uses "sc" for the SparkContext.
>
> Am I missing something?
>
> Thanks!
>
> RJ
>
>
> --
> em rnowl...@gmail.com
> c 954.496.2314
>


Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Sandy Ryza
It could make sense to add a skipHeader argument to SparkContext.textFile?


On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin  wrote:

> If the purpose is for dropping csv headers, perhaps we don't really need a
> common drop and only one that drops the first line in a file? I'd really
> try hard to avoid a common drop/dropWhile because they can be expensive to
> do.
>
> Note that I think we will be adding this functionality (ignoring headers)
> to the CsvRDD functionality in Spark SQL.
>  https://github.com/apache/spark/pull/1351
>
>
> On Mon, Jul 21, 2014 at 1:45 PM, Mark Hamstra 
> wrote:
>
> > You can find some of the prior, related discussion here:
> > https://issues.apache.org/jira/browse/SPARK-1021
> >
> >
> > On Mon, Jul 21, 2014 at 1:25 PM, Erik Erlandson  wrote:
> >
> > >
> > >
> > > - Original Message -
> > > > Rather than embrace non-lazy transformations and add more of them,
> I'd
> > > > rather we 1) try to fully characterize the needs that are driving
> their
> > > > creation/usage; and 2) design and implement new Spark abstractions
> that
> > > > will allow us to meet those needs and eliminate existing non-lazy
> > > > transformation.
> > >
> > >
> > > In the case of drop, obtaining the index of the boundary partition can
> be
> > > viewed as the action forcing compute -- one that happens to be invoked
> > > inside of a transform.  The concept of a "lazy action", that is only
> > > triggered if the result rdd has compute invoked on it, might be
> > sufficient
> > > to restore laziness to the drop transform.   For that matter, I might
> > find
> > > some way to make use of Scala lazy values directly and achieve the same
> > > goal for drop.
> > >
> > >
> > >
> > > > They really mess up things like creation of asynchronous
> > > > FutureActions, job cancellation and accounting of job resource usage,
> > > etc.,
> > > > so I'd rather we seek a way out of the existing hole rather than make
> > it
> > > > deeper.
> > > >
> > > >
> > > > On Mon, Jul 21, 2014 at 10:24 AM, Erik Erlandson 
> > wrote:
> > > >
> > > > >
> > > > >
> > > > > - Original Message -
> > > > > > Sure, drop() would be useful, but breaking the "transformations
> are
> > > lazy;
> > > > > > only actions launch jobs" model is abhorrent -- which is not to
> say
> > > that
> > > > > we
> > > > > > haven't already broken that model for useful operations (cf.
> > > > > > RangePartitioner, which is used for sorted RDDs), but rather that
> > > each
> > > > > such
> > > > > > exception to the model is a significant source of pain that can
> be
> > > hard
> > > > > to
> > > > > > work with or work around.
> > > > >
> > > > > A thought that comes to my mind here is that there are in fact
> > already
> > > two
> > > > > categories of transform: ones that are truly lazy, and ones that
> are
> > > not.
> > > > >  A possible option is to embrace that, and commit to documenting
> the
> > > two
> > > > > categories as such, with an obvious bias towards favoring lazy
> > > transforms
> > > > > (to paraphrase Churchill, we're down to haggling over the price).
> > > > >
> > > > >
> > > > > >
> > > > > > I really wouldn't like to see another such model-breaking
> > > transformation
> > > > > > added to the API.  On the other hand, being able to write
> > > transformations
> > > > > > with dependencies on these kind of "internal" jobs is sometimes
> > very
> > > > > > useful, so a significant reworking of Spark's Dependency model
> that
> > > would
> > > > > > allow for lazily running such internal jobs and making the
> results
> > > > > > available to subsequent stages may be something worth pursuing.
> > > > >
> > > > >
> > > > > This seems like a very interesting angle.   I don't have much feel
> > for
> > > > > what a solution would look like, but it sounds as if it would
> involve
> > > > > caching all operations embodied by RDD transform method code for
> > > > > provisional execution.  I believe that these levels of invocation
> are
> > > > > currently executed in the master, not executor nodes.
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash <
> and...@andrewash.com>
> > > > > wrote:
> > > > > >
> > > > > > > Personally I'd find the method useful -- I've often had a .csv
> > file
> > > > > with a
> > > > > > > header row that I want to drop so filter it out, which touches
> > all
> > > > > > > partitions anyway.  I don't have any comments on the
> > implementation
> > > > > quite
> > > > > > > yet though.
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson <
> e...@redhat.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > A few weeks ago I submitted a PR for supporting rdd.drop(n),
> > > under
> > > > > > > > SPARK-2315:
> > > > > > > > https://issues.apache.org/jira/browse/SPARK-2315
> > > > > > > >
> > > > > > > > Supporting the drop method would make some operations
> > convenient,
> > > > > however
> > > > > > > > it forces computatio

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Sandy Ryza
Yeah, the input format doesn't support this behavior.  But it does tell you
the byte position of each record in the file.


On Mon, Jul 21, 2014 at 10:55 PM, Reynold Xin  wrote:

> Yes, that could work. But it is not as simple as just a binary flag.
>
> We might want to skip the first row for every file, or the header only for
> the first file. The former is not really supported out of the box by the
> input format I think?
>
>
> On Mon, Jul 21, 2014 at 10:50 PM, Sandy Ryza 
> wrote:
>
> > It could make sense to add a skipHeader argument to
> SparkContext.textFile?
> >
> >
> > On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin 
> wrote:
> >
> > > If the purpose is for dropping csv headers, perhaps we don't really
> need
> > a
> > > common drop and only one that drops the first line in a file? I'd
> really
> > > try hard to avoid a common drop/dropWhile because they can be expensive
> > to
> > > do.
> > >
> > > Note that I think we will be adding this functionality (ignoring
> headers)
> > > to the CsvRDD functionality in Spark SQL.
> > >  https://github.com/apache/spark/pull/1351
> > >
> > >
> > > On Mon, Jul 21, 2014 at 1:45 PM, Mark Hamstra  >
> > > wrote:
> > >
> > > > You can find some of the prior, related discussion here:
> > > > https://issues.apache.org/jira/browse/SPARK-1021
> > > >
> > > >
> > > > On Mon, Jul 21, 2014 at 1:25 PM, Erik Erlandson 
> > wrote:
> > > >
> > > > >
> > > > >
> > > > > - Original Message -
> > > > > > Rather than embrace non-lazy transformations and add more of
> them,
> > > I'd
> > > > > > rather we 1) try to fully characterize the needs that are driving
> > > their
> > > > > > creation/usage; and 2) design and implement new Spark
> abstractions
> > > that
> > > > > > will allow us to meet those needs and eliminate existing non-lazy
> > > > > > transformation.
> > > > >
> > > > >
> > > > > In the case of drop, obtaining the index of the boundary partition
> > can
> > > be
> > > > > viewed as the action forcing compute -- one that happens to be
> > invoked
> > > > > inside of a transform.  The concept of a "lazy action", that is
> only
> > > > > triggered if the result rdd has compute invoked on it, might be
> > > > sufficient
> > > > > to restore laziness to the drop transform.   For that matter, I
> might
> > > > find
> > > > > some way to make use of Scala lazy values directly and achieve the
> > same
> > > > > goal for drop.
> > > > >
> > > > >
> > > > >
> > > > > > They really mess up things like creation of asynchronous
> > > > > > FutureActions, job cancellation and accounting of job resource
> > usage,
> > > > > etc.,
> > > > > > so I'd rather we seek a way out of the existing hole rather than
> > make
> > > > it
> > > > > > deeper.
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 21, 2014 at 10:24 AM, Erik Erlandson  >
> > > > wrote:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > - Original Message -
> > > > > > > > Sure, drop() would be useful, but breaking the
> "transformations
> > > are
> > > > > lazy;
> > > > > > > > only actions launch jobs" model is abhorrent -- which is not
> to
> > > say
> > > > > that
> > > > > > > we
> > > > > > > > haven't already broken that model for useful operations (cf.
> > > > > > > > RangePartitioner, which is used for sorted RDDs), but rather
> > that
> > > > > each
> > > > > > > such
> > > > > > > > exception to the model is a significant source of pain that
> can
> > > be
> > > > > hard
> > > > > > > to
> > > > > > > > work with or work around.
> > > > > > >
> > > > > > > A thought that comes to my mind here is that there are in fact
> > > > already
> > > >

Re: setting inputMetrics in HadoopRDD#compute()

2014-07-26 Thread Sandy Ryza
I'm working on a patch that switches this stuff out with the Hadoop
FileSystem StatisticsData, which will both give an accurate count and allow
us to get metrics while the task is in progress.  A hitch is that it relies
on https://issues.apache.org/jira/browse/HADOOP-10688, so we still might
want a fallback for versions of Hadoop that don't have this API.


On Sat, Jul 26, 2014 at 10:47 AM, Reynold Xin  wrote:

> There is one piece of information that'd be useful to know, which is the
> source of the input. Even in the presence of an IOException, the input
> metrics still specifies the task is reading from Hadoop.
>
> However, I'm slightly confused by this -- I think usually we'd want to
> report the number of bytes read, rather than the total input size. For
> example, if there is a limit (only read the first 5 records), the actual
> number of bytes read is much smaller than the total split size.
>
> Kay, am I mis-interpreting this?
>
>
>
> On Sat, Jul 26, 2014 at 7:42 AM, Ted Yu  wrote:
>
> > Hi,
> > Starting at line 203:
> >   try {
> > /* bytesRead may not exactly equal the bytes read by a task:
> split
> > boundaries aren't
> >  * always at record boundaries, so tasks may need to read into
> > other splits to complete
> >  * a record. */
> > inputMetrics.bytesRead = split.inputSplit.value.getLength()
> >   } catch {
> > case e: java.io.IOException =>
> >   logWarning("Unable to get input size to set InputMetrics for
> > task", e)
> >   }
> >   context.taskMetrics.inputMetrics = Some(inputMetrics)
> >
> > If there is IOException, context.taskMetrics.inputMetrics is set by
> > wrapping inputMetrics - as if there wasn't any error.
> >
> > I wonder if the above code should distinguish the error condition.
> >
> > Cheers
> >
>


Re: Fraud management system implementation

2014-07-28 Thread Sandy Ryza
+user list
bcc: dev list

It's definitely possible to implement credit fraud management using Spark.
 A good start would be using some of the supervised learning algorithms
that Spark provides in MLLib (logistic regression or linear SVMs).

Spark doesn't have any HMM implementation right now.  Sean Owen has a great
talk on performing anomaly detection with KMeans clustering in Spark -
https://www.youtube.com/watch?v=TC5cKYBZAeI

-Sandy


On Mon, Jul 28, 2014 at 7:15 AM, jitendra shelar <
jitendra.shelar...@gmail.com> wrote:

> Hi,
>
> I am new to spark. I am learning spark and scala.
>
> I had some queries.
>
> 1) Can somebody please tell me if it is possible to implement credit
> card fraud management system using spark?
> 2) If yes, can somebody please guide me how to proceed.
> 3) Shall I prefer Scala or Java for this implementation?
>
> 4) Please suggest me some pointers related to Hidden Markonav Model
> (HMM) and anomaly detection in data mining (using spark).
>
> Thanks,
> Jitendra
>


Re: Fine-Grained Scheduler on Yarn

2014-08-08 Thread Sandy Ryza
Hi Jun,

Spark currently doesn't have that feature, i.e. it aims for a fixed number
of executors per application regardless of resource usage, but it's
definitely worth considering.  We could start more executors when we have a
large backlog of tasks and shut some down when we're underutilized.

The fine-grained task scheduling is blocked on work from YARN that will
allow changing the CPU allocation of a YARN container dynamically.  The
relevant JIRA for this dependency is YARN-1197, though YARN-1488 might
serve this purpose as well if it comes first.

-Sandy


On Thu, Aug 7, 2014 at 10:56 PM, Jun Feng Liu  wrote:

> Thanks for echo on this. Possible to adjust resource based on container
> numbers? e.g to allocate more container when driver need more resources and
> return some resource by delete some container when parts of container
> already have enough cores/memory
>
> Best Regards
>
>
> *Jun Feng Liu*
>
> IBM China Systems & Technology Laboratory in Beijing
>
>   --
>  [image: 2D barcode - encoded with contact information]
> *Phone: *86-10-82452683
> * E-mail:* *liuj...@cn.ibm.com* 
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>
>  *Patrick Wendell >*
>
> 2014/08/08 13:10
>   To
> Jun Feng Liu/China/IBM@IBMCN,
> cc
> "dev@spark.apache.org" 
> Subject
> Re: Fine-Grained Scheduler on Yarn
>
>
>
>
> Hey sorry about that - what I said was the opposite of what is true.
>
> The current YARN mode is equivalent to "coarse grained" mesos. There is no
> fine-grained scheduling on YARN at the moment. I'm not sure YARN supports
> scheduling in units other than containers. Fine-grained scheduling requires
> scheduling at the granularity of individual cores.
>
>
> On Thu, Aug 7, 2014 at 9:43 PM, Patrick Wendell <*pwend...@gmail.com*
> > wrote:
> The current YARN is equivalent to what is called "fine grained" mode in
> Mesos. The scheduling of tasks happens totally inside of the Spark driver.
>
>
> On Thu, Aug 7, 2014 at 7:50 PM, Jun Feng Liu <*liuj...@cn.ibm.com*
> > wrote:
> Any one know the answer?
> Best Regards
>
>
> * Jun Feng Liu*
>
> IBM China Systems & Technology Laboratory in Beijing
>
>   --
>  *Phone: *86-10-82452683
> * E-mail:* *liuj...@cn.ibm.com* 
>
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>   *Jun Feng Liu/China/IBM*
>
> 2014/08/07 15:37
>
>   To
> *dev@spark.apache.org* ,
> cc
>   Subject
> Fine-Grained Scheduler on Yarn
>
>
>
>
>
> Hi, there
>
> Just aware right now Spark only support fine grained scheduler on Mesos
> with MesosSchedulerBackend. The Yarn schedule sounds like only works on
> coarse-grained model. Is there any plan to implement fine-grained scheduler
> for YARN? Or there is any technical issue block us to do that.
>
> Best Regards
>
>
> * Jun Feng Liu*
>
> IBM China Systems & Technology Laboratory in Beijing
>
>   --
>  *Phone: *86-10-82452683
> * E-mail:* *liuj...@cn.ibm.com* 
>
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>
>
>


Re: Fine-Grained Scheduler on Yarn

2014-08-08 Thread Sandy Ryza
I think that would be useful work.  I don't know the minute details of this
code, but in general TaskSchedulerImpl keeps track of pending tasks.  Tasks
are organized into TaskSets, each of which corresponds to a particular
stage.  Each TaskSet has a TaskSetManager, which directly tracks the
pending tasks for that stage.

-Sandy


On Fri, Aug 8, 2014 at 12:37 AM, Jun Feng Liu  wrote:

> Yes, I think we need both level resource control (container numbers and
> dynamically change container resources), which can make the resource
> utilization much more effective, especially when we have more types work
> load share the same infrastructure.
>
> Is there anyway I can observe the tasks backlog in schedulerbackend?
> Sounds like scheduler backend be triggered during new taskset submitted. I
> did not figured if there is a way to check the whole backlog tasks inside
> it. I am interesting to implement some policy in schedulerbackend and test
> to see how useful it is going to be.
>
> Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
>   --
>  [image: 2D barcode - encoded with contact information] *Phone: 
> *86-10-82452683
>
> * E-mail:* *liuj...@cn.ibm.com* 
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>
>  *Sandy Ryza >*
>
> 2014/08/08 15:14
>   To
> Jun Feng Liu/China/IBM@IBMCN,
> cc
> Patrick Wendell , "dev@spark.apache.org" <
> dev@spark.apache.org>
> Subject
> Re: Fine-Grained Scheduler on Yarn
>
>
>
>
> Hi Jun,
>
> Spark currently doesn't have that feature, i.e. it aims for a fixed number
> of executors per application regardless of resource usage, but it's
> definitely worth considering.  We could start more executors when we have a
> large backlog of tasks and shut some down when we're underutilized.
>
> The fine-grained task scheduling is blocked on work from YARN that will
> allow changing the CPU allocation of a YARN container dynamically.  The
> relevant JIRA for this dependency is YARN-1197, though YARN-1488 might
> serve this purpose as well if it comes first.
>
> -Sandy
>
>
> On Thu, Aug 7, 2014 at 10:56 PM, Jun Feng Liu  wrote:
>
> > Thanks for echo on this. Possible to adjust resource based on container
> > numbers? e.g to allocate more container when driver need more resources
> and
> > return some resource by delete some container when parts of container
> > already have enough cores/memory
> >
> > Best Regards
> >
> >
> > *Jun Feng Liu*
>
> >
> > IBM China Systems & Technology Laboratory in Beijing
> >
> >   --
>
> >  [image: 2D barcode - encoded with contact information]
> > *Phone: *86-10-82452683
> > * E-mail:* *liuj...@cn.ibm.com* 
>
> > [image: IBM]
> >
> > BLD 28,ZGC Software Park
> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> > China
> >
> >
> >
> >
> >
> >  *Patrick Wendell >*
>
> >
> > 2014/08/08 13:10
> >   To
> > Jun Feng Liu/China/IBM@IBMCN,
> > cc
> > "dev@spark.apache.org" 
> > Subject
> > Re: Fine-Grained Scheduler on Yarn
> >
> >
> >
> >
> > Hey sorry about that - what I said was the opposite of what is true.
> >
> > The current YARN mode is equivalent to "coarse grained" mesos. There is
> no
> > fine-grained scheduling on YARN at the moment. I'm not sure YARN supports
> > scheduling in units other than containers. Fine-grained scheduling
> requires
> > scheduling at the granularity of individual cores.
> >
> >
> > On Thu, Aug 7, 2014 at 9:43 PM, Patrick Wendell <*pwend...@gmail.com*
>
> > > wrote:
> > The current YARN is equivalent to what is called "fine grained" mode in
> > Mesos. The scheduling of tasks happens totally inside of the Spark
> driver.
> >
> >
> > On Thu, Aug 7, 2014 at 7:50 PM, Jun Feng Liu <*liuj...@cn.ibm.com*
>
> > > wrote:
> > Any one know the answer?
> > Best Regards
> >
> >
> > * Jun Feng Liu*
>
> >
> > IBM China Systems & Technology Laboratory in Beijing
> >
> >   --
> >  *Phone: *86-10-82452683
> > * E-mail:* *liuj...@cn.ibm.com* 
>
> >
> >
> > BLD 28,ZGC Software Park
> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> > China
> >
> >
> >
> >
> >   *Jun Feng

Re: spark-shell is broken! (bad option: '--master')

2014-08-08 Thread Sandy Ryza
Hi Chutium,

This is currently being addressed in
https://github.com/apache/spark/pull/1825

-Sandy


On Fri, Aug 8, 2014 at 2:26 PM, chutium  wrote:

> no one use spark-shell in master branch?
>
> i created a PR as follow up commit of SPARK-2678 and PR #1801:
>
> https://github.com/apache/spark/pull/1861
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-shell-is-broken-bad-option-master-tp7778p7780.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Extra libs for bin/spark-shell - specifically for hbase

2014-08-16 Thread Sandy Ryza
Hi Stephen,

Have you tried the --jars option (with jars separated by commas)?  It
should make the given jars available both to the driver and the executors.
 I believe one caveat currently is that if you give it a folder it won't
pick up all the jars inside.

-Sandy


On Fri, Aug 15, 2014 at 4:07 PM, Stephen Boesch  wrote:

> Although this has been discussed a number of times here, I am still unclear
> how to add user jars to the spark-shell:
>
> a) for importing classes for use directly within the shell interpreter
>
> b) for  invoking SparkContext commands with closures referencing user
> supplied classes contained within jar's.
>
> Similarly to other posts, I have gone through:
>
>  updating bin/spark-env.sh
>  SPARK_CLASSPATH
>  SPARK_SUBMIT_OPTS
>   creating conf/spark-defaults.conf  and adding
>  spark.executor.extraClassPath
> --driver-class-path
>   etc
>
> Hopefully there would be something along the lines of  a single entry added
> to some claspath somewhere like this
>
>SPARK_CLASSPATH/driver-class-path/spark.executor.extraClassPath (or
> whatever is the correct option..)  =
> $HBASE_HOME/*:$HBASE_HOME/lib/*:$SPARK_CLASSPATH
>
> Any ideas here?
>
> thanks
>


Re: Lost executor on YARN ALS iterations

2014-08-20 Thread Sandy Ryza
Hi Debasish,

The fix is to raise spark.yarn.executor.memoryOverhead until this goes
away.  This controls the buffer between the JVM heap size and the amount of
memory requested from YARN (JVMs can take up memory beyond their heap
size). You should also make sure that, in the YARN NodeManager
configuration, yarn.nodemanager.vmem-check-enabled is set to false.

-Sandy


On Wed, Aug 20, 2014 at 12:27 AM, Debasish Das 
wrote:

> I could reproduce the issue in both 1.0 and 1.1 using YARN...so this is
> definitely a YARN related problem...
>
> At least for me right now only deployment option possible is standalone...
>
>
>
> On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng  wrote:
>
>> Hi Deb,
>>
>> I think this may be the same issue as described in
>> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
>> container got killed by YARN because it used much more memory that it
>> requested. But we haven't figured out the root cause yet.
>>
>> +Sandy
>>
>> Best,
>> Xiangrui
>>
>> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das 
>> wrote:
>> > Hi,
>> >
>> > During the 4th ALS iteration, I am noticing that one of the executor
>> gets
>> > disconnected:
>> >
>> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
>> > SendingConnectionManagerId not found
>> >
>> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
>> > disconnected, so removing it
>> >
>> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
>> executor 5
>> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated
>> >
>> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch
>> 12)
>> > Any idea if this is a bug related to akka on YARN ?
>> >
>> > I am using master
>> >
>> > Thanks.
>> > Deb
>>
>
>


Re: about spark assembly jar

2014-09-02 Thread Sandy Ryza
This doesn't help for every dependency, but Spark provides an option to
build the assembly jar without Hadoop and its dependencies.  We make use of
this in CDH packaging.

-Sandy


On Tue, Sep 2, 2014 at 2:12 AM, scwf  wrote:

> Hi sean owen,
> here are some problems when i used assembly jar
> 1 i put spark-assembly-*.jar to the lib directory of my application, it
> throw compile error
>
> Error:scalac: Error: class scala.reflect.BeanInfo not found.
> scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not
> found.
>
> at scala.tools.nsc.symtab.Definitions$definitions$.
> getModuleOrClass(Definitions.scala:655)
>
> at scala.tools.nsc.symtab.Definitions$definitions$.
> getClass(Definitions.scala:608)
>
> at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.<
> init>(GenJVM.scala:127)
>
> at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.
> scala:85)
>
> at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
>
> at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
>
> at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)
>
> at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)
>
> at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)
>
> at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:39)
>
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:25)
>
> at java.lang.reflect.Method.invoke(Method.java:597)
>
> at sbt.compiler.AnalyzingCompiler.call(
> AnalyzingCompiler.scala:102)
>
> at sbt.compiler.AnalyzingCompiler.compile(
> AnalyzingCompiler.scala:48)
>
> at sbt.compiler.AnalyzingCompiler.compile(
> AnalyzingCompiler.scala:41)
>
> at org.jetbrains.jps.incremental.scala.local.
> IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)
>
> at org.jetbrains.jps.incremental.scala.local.LocalServer.
> compile(LocalServer.scala:25)
>
> at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.
> scala:58)
>
> at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(
> Main.scala:21)
>
> at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(
> Main.scala)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:39)
>
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:25)
>
> at java.lang.reflect.Method.invoke(Method.java:597)
>
> at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
> 2 i test my branch which updated hive version to org.apache.hive 0.13.1
>   it run successfully when use a bag of 3rd jars as dependency but throw
> error using assembly jar, it seems assembly jar lead to conflict
>   ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
> at org.apache.hadoop.hive.ql.io.parquet.serde.
> ArrayWritableObjectInspector.getObjectInspector(
> ArrayWritableObjectInspector.java:66)
> at org.apache.hadoop.hive.ql.io.parquet.serde.
> ArrayWritableObjectInspector.(ArrayWritableObjectInspector.java:59)
> at org.apache.hadoop.hive.ql.io.parquet.serde.
> ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
> at org.apache.hadoop.hive.metastore.MetaStoreUtils.
> getDeserializer(MetaStoreUtils.java:339)
> at org.apache.hadoop.hive.ql.metadata.Table.
> getDeserializerFromMetaStore(Table.java:283)
> at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(
> Table.java:189)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(
> Hive.java:597)
> at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(
> DDLTask.java:4194)
> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.
> java:281)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
> at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(
> TaskRunner.java:85)
>
>
>
>
>
> On 2014/9/2 16:45, Sean Owen wrote:
>
>> Hm, are you suggesting that the Spark distribution be a bag of 100
>> JARs? It doesn't quite seem reasonable. It does not remove version
>> conflicts, just pushes them to run-time, which isn't good. The
>> assembly is also necessary because that's where shading happens. In
>> development, you want to run against exactly what will be used in a
>> real Spark distro.
>>
>> On Tue, Sep 2, 2014 at 9:39 AM, scwf  wrote:
>>
>>> hi, all
>>>I suggest spark not use assembly jar as default run-time
>>> dependency(spark-submit/spark-class depend on assembly jar),use a
>>> library of
>>> all 3rd dependency jar like hadoop/hive/hbase more reasonable.
>>>
>>>1 assembly jar packaged all 3rd jars into a big one, 

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Sandy Ryza
Hi Deb,

The current state of the art is to increase
spark.yarn.executor.memoryOverhead until the job stops failing.  We do have
plans to try to automatically scale this based on the amount of memory
requested, but it will still just be a heuristic.

-Sandy

On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das 
wrote:

> Hi Sandy,
>
> Any resolution for YARN failures ? It's a blocker for running spark on top
> of YARN.
>
> Thanks.
> Deb
>
> On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng  wrote:
>
>> Hi Deb,
>>
>> I think this may be the same issue as described in
>> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
>> container got killed by YARN because it used much more memory that it
>> requested. But we haven't figured out the root cause yet.
>>
>> +Sandy
>>
>> Best,
>> Xiangrui
>>
>> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das 
>> wrote:
>> > Hi,
>> >
>> > During the 4th ALS iteration, I am noticing that one of the executor
>> gets
>> > disconnected:
>> >
>> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
>> > SendingConnectionManagerId not found
>> >
>> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
>> > disconnected, so removing it
>> >
>> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
>> executor 5
>> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated
>> >
>> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch
>> 12)
>> > Any idea if this is a bug related to akka on YARN ?
>> >
>> > I am using master
>> >
>> > Thanks.
>> > Deb
>>
>
>


Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Sandy Ryza
I would expect 2 GB would be enough or more than enough for 16 GB executors
(unless ALS is using a bunch of off-heap memory?).  You mentioned earlier
in this thread that the property wasn't showing up in the Environment tab.
 Are you sure it's making it in?

-Sandy

On Tue, Sep 9, 2014 at 11:58 AM, Debasish Das 
wrote:

> Hmm...I did try it increase to few gb but did not get a successful run
> yet...
>
> Any idea if I am using say 40 executors, each running 16GB, what's the
> typical spark.yarn.executor.memoryOverhead for say 100M x 10 M large
> matrices with say few billion ratings...
>
> On Tue, Sep 9, 2014 at 10:49 AM, Sandy Ryza 
> wrote:
>
>> Hi Deb,
>>
>> The current state of the art is to increase
>> spark.yarn.executor.memoryOverhead until the job stops failing.  We do have
>> plans to try to automatically scale this based on the amount of memory
>> requested, but it will still just be a heuristic.
>>
>> -Sandy
>>
>> On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das 
>> wrote:
>>
>>> Hi Sandy,
>>>
>>> Any resolution for YARN failures ? It's a blocker for running spark on
>>> top of YARN.
>>>
>>> Thanks.
>>> Deb
>>>
>>> On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng 
>>> wrote:
>>>
>>>> Hi Deb,
>>>>
>>>> I think this may be the same issue as described in
>>>> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
>>>> container got killed by YARN because it used much more memory that it
>>>> requested. But we haven't figured out the root cause yet.
>>>>
>>>> +Sandy
>>>>
>>>> Best,
>>>> Xiangrui
>>>>
>>>> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das 
>>>> wrote:
>>>> > Hi,
>>>> >
>>>> > During the 4th ALS iteration, I am noticing that one of the executor
>>>> gets
>>>> > disconnected:
>>>> >
>>>> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
>>>> > SendingConnectionManagerId not found
>>>> >
>>>> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
>>>> > disconnected, so removing it
>>>> >
>>>> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
>>>> executor 5
>>>> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client
>>>> disassociated
>>>> >
>>>> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5
>>>> (epoch 12)
>>>> > Any idea if this is a bug related to akka on YARN ?
>>>> >
>>>> > I am using master
>>>> >
>>>> > Thanks.
>>>> > Deb
>>>>
>>>
>>>
>>
>


Re: Lost executor on YARN ALS iterations

2014-09-10 Thread Sandy Ryza
That's right

On Tue, Sep 9, 2014 at 2:04 PM, Debasish Das 
wrote:

> Last time it did not show up on environment tab but I will give it another
> shot...Expected behavior is that this env variable will show up right ?
>
> On Tue, Sep 9, 2014 at 12:15 PM, Sandy Ryza 
> wrote:
>
>> I would expect 2 GB would be enough or more than enough for 16 GB
>> executors (unless ALS is using a bunch of off-heap memory?).  You mentioned
>> earlier in this thread that the property wasn't showing up in the
>> Environment tab.  Are you sure it's making it in?
>>
>> -Sandy
>>
>> On Tue, Sep 9, 2014 at 11:58 AM, Debasish Das 
>> wrote:
>>
>>> Hmm...I did try it increase to few gb but did not get a successful run
>>> yet...
>>>
>>> Any idea if I am using say 40 executors, each running 16GB, what's the
>>> typical spark.yarn.executor.memoryOverhead for say 100M x 10 M large
>>> matrices with say few billion ratings...
>>>
>>> On Tue, Sep 9, 2014 at 10:49 AM, Sandy Ryza 
>>> wrote:
>>>
>>>> Hi Deb,
>>>>
>>>> The current state of the art is to increase
>>>> spark.yarn.executor.memoryOverhead until the job stops failing.  We do have
>>>> plans to try to automatically scale this based on the amount of memory
>>>> requested, but it will still just be a heuristic.
>>>>
>>>> -Sandy
>>>>
>>>> On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das 
>>>> wrote:
>>>>
>>>>> Hi Sandy,
>>>>>
>>>>> Any resolution for YARN failures ? It's a blocker for running spark on
>>>>> top of YARN.
>>>>>
>>>>> Thanks.
>>>>> Deb
>>>>>
>>>>> On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng 
>>>>> wrote:
>>>>>
>>>>>> Hi Deb,
>>>>>>
>>>>>> I think this may be the same issue as described in
>>>>>> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
>>>>>> container got killed by YARN because it used much more memory that it
>>>>>> requested. But we haven't figured out the root cause yet.
>>>>>>
>>>>>> +Sandy
>>>>>>
>>>>>> Best,
>>>>>> Xiangrui
>>>>>>
>>>>>> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <
>>>>>> debasish.da...@gmail.com> wrote:
>>>>>> > Hi,
>>>>>> >
>>>>>> > During the 4th ALS iteration, I am noticing that one of the
>>>>>> executor gets
>>>>>> > disconnected:
>>>>>> >
>>>>>> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
>>>>>> > SendingConnectionManagerId not found
>>>>>> >
>>>>>> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor
>>>>>> 5
>>>>>> > disconnected, so removing it
>>>>>> >
>>>>>> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
>>>>>> executor 5
>>>>>> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client
>>>>>> disassociated
>>>>>> >
>>>>>> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5
>>>>>> (epoch 12)
>>>>>> > Any idea if this is a bug related to akka on YARN ?
>>>>>> >
>>>>>> > I am using master
>>>>>> >
>>>>>> > Thanks.
>>>>>> > Deb
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Reporting serialized task size after task broadcast change?

2014-09-11 Thread Sandy Ryza
After the change to broadcast all task data, is there any easy way to
discover the serialized size of the data getting sent down for a task?

thanks,
-Sandy


Re: Reporting serialized task size after task broadcast change?

2014-09-11 Thread Sandy Ryza
It used to be available on the UI, no?

On Thu, Sep 11, 2014 at 6:26 PM, Reynold Xin  wrote:

> I don't think so. We should probably add a line to log it.
>
>
> On Thursday, September 11, 2014, Sandy Ryza 
> wrote:
>
>> After the change to broadcast all task data, is there any easy way to
>> discover the serialized size of the data getting sent down for a task?
>>
>> thanks,
>> -Sandy
>>
>


  1   2   >