TorrentBroadcast aka Cornet?

2014-05-19 Thread Andrew Ash
Hi Spark devs,

Is the algorithm for
TorrentBroadcastthe
same as Cornet from the below paper?

http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf

If so it would be nice to include a link to the paper in the Javadoc for
the class.

Thanks!
Andrew


Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread DB Tsai
Hi Sean,

It's true that the issue here is classloader, and due to the classloader
delegation model, users have to use reflection in the executors to pick up
the classloader in order to use those classes added by sc.addJars APIs.
However, it's very inconvenience for users, and not documented in spark.

I'm working on a patch to solve it by calling the protected method addURL
in URLClassLoader to update the current default classloader, so no
customClassLoader anymore. I wonder if this is an good way to go.

  private def addURL(url: URL, loader: URLClassLoader){
try {
  val method: Method =
classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
  method.setAccessible(true)
  method.invoke(loader, url)
}
catch {
  case t: Throwable => {
throw new IOException("Error, could not add URL to system
classloader")
  }
}
  }



Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, May 18, 2014 at 11:57 PM, Sean Owen  wrote:

> I might be stating the obvious for everyone, but the issue here is not
> reflection or the source of the JAR, but the ClassLoader. The basic
> rules are this.
>
> "new Foo" will use the ClassLoader that defines Foo. This is usually
> the ClassLoader that loaded whatever it is that first referenced Foo
> and caused it to be loaded -- usually the ClassLoader holding your
> other app classes.
>
> ClassLoaders can have a parent-child relationship. ClassLoaders always
> look in their parent before themselves.
>
> (Careful then -- in contexts like Hadoop or Tomcat where your app is
> loaded in a child ClassLoader, and you reference a class that Hadoop
> or Tomcat also has (like a lib class) you will get the container's
> version!)
>
> When you load an external JAR it has a separate ClassLoader which does
> not necessarily bear any relation to the one containing your app
> classes, so yeah it is not generally going to make "new Foo" work.
>
> Reflection lets you pick the ClassLoader, yes.
>
> I would not call setContextClassLoader.
>
> On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza 
> wrote:
> > I spoke with DB offline about this a little while ago and he confirmed
> that
> > he was able to access the jar from the driver.
> >
> > The issue appears to be a general Java issue: you can't directly
> > instantiate a class from a dynamically loaded jar.
> >
> > I reproduced it locally outside of Spark with:
> > ---
> > URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
> > File("myotherjar.jar").toURI().toURL() }, null);
> > Thread.currentThread().setContextClassLoader(urlClassLoader);
> > MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> > ---
> >
> > I was able to load the class with reflection.
>


Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread Andrew Ash
Sounds like the problem is that classloaders always look in their parents
before themselves, and Spark users want executors to pick up classes from
their custom code before the ones in Spark plus its dependencies.

Would a custom classloader that delegates to the parent after first
checking itself fix this up?


On Mon, May 19, 2014 at 12:17 AM, DB Tsai  wrote:

> Hi Sean,
>
> It's true that the issue here is classloader, and due to the classloader
> delegation model, users have to use reflection in the executors to pick up
> the classloader in order to use those classes added by sc.addJars APIs.
> However, it's very inconvenience for users, and not documented in spark.
>
> I'm working on a patch to solve it by calling the protected method addURL
> in URLClassLoader to update the current default classloader, so no
> customClassLoader anymore. I wonder if this is an good way to go.
>
>   private def addURL(url: URL, loader: URLClassLoader){
> try {
>   val method: Method =
> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
>   method.setAccessible(true)
>   method.invoke(loader, url)
> }
> catch {
>   case t: Throwable => {
> throw new IOException("Error, could not add URL to system
> classloader")
>   }
> }
>   }
>
>
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Sun, May 18, 2014 at 11:57 PM, Sean Owen  wrote:
>
> > I might be stating the obvious for everyone, but the issue here is not
> > reflection or the source of the JAR, but the ClassLoader. The basic
> > rules are this.
> >
> > "new Foo" will use the ClassLoader that defines Foo. This is usually
> > the ClassLoader that loaded whatever it is that first referenced Foo
> > and caused it to be loaded -- usually the ClassLoader holding your
> > other app classes.
> >
> > ClassLoaders can have a parent-child relationship. ClassLoaders always
> > look in their parent before themselves.
> >
> > (Careful then -- in contexts like Hadoop or Tomcat where your app is
> > loaded in a child ClassLoader, and you reference a class that Hadoop
> > or Tomcat also has (like a lib class) you will get the container's
> > version!)
> >
> > When you load an external JAR it has a separate ClassLoader which does
> > not necessarily bear any relation to the one containing your app
> > classes, so yeah it is not generally going to make "new Foo" work.
> >
> > Reflection lets you pick the ClassLoader, yes.
> >
> > I would not call setContextClassLoader.
> >
> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza 
> > wrote:
> > > I spoke with DB offline about this a little while ago and he confirmed
> > that
> > > he was able to access the jar from the driver.
> > >
> > > The issue appears to be a general Java issue: you can't directly
> > > instantiate a class from a dynamically loaded jar.
> > >
> > > I reproduced it locally outside of Spark with:
> > > ---
> > > URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
> > > File("myotherjar.jar").toURI().toURL() }, null);
> > > Thread.currentThread().setContextClassLoader(urlClassLoader);
> > > MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> > > ---
> > >
> > > I was able to load the class with reflection.
> >
>


Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread Sean Owen
I don't think a customer classloader is necessary.

Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
all run custom user code that creates new user objects without
reflection. I should go see how that's done. Maybe it's totally valid
to set the thread's context classloader for just this purpose, and I
am not thinking clearly.

On Mon, May 19, 2014 at 8:26 AM, Andrew Ash  wrote:
> Sounds like the problem is that classloaders always look in their parents
> before themselves, and Spark users want executors to pick up classes from
> their custom code before the ones in Spark plus its dependencies.
>
> Would a custom classloader that delegates to the parent after first
> checking itself fix this up?
>
>
> On Mon, May 19, 2014 at 12:17 AM, DB Tsai  wrote:
>
>> Hi Sean,
>>
>> It's true that the issue here is classloader, and due to the classloader
>> delegation model, users have to use reflection in the executors to pick up
>> the classloader in order to use those classes added by sc.addJars APIs.
>> However, it's very inconvenience for users, and not documented in spark.
>>
>> I'm working on a patch to solve it by calling the protected method addURL
>> in URLClassLoader to update the current default classloader, so no
>> customClassLoader anymore. I wonder if this is an good way to go.
>>
>>   private def addURL(url: URL, loader: URLClassLoader){
>> try {
>>   val method: Method =
>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
>>   method.setAccessible(true)
>>   method.invoke(loader, url)
>> }
>> catch {
>>   case t: Throwable => {
>> throw new IOException("Error, could not add URL to system
>> classloader")
>>   }
>> }
>>   }
>>
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen  wrote:
>>
>> > I might be stating the obvious for everyone, but the issue here is not
>> > reflection or the source of the JAR, but the ClassLoader. The basic
>> > rules are this.
>> >
>> > "new Foo" will use the ClassLoader that defines Foo. This is usually
>> > the ClassLoader that loaded whatever it is that first referenced Foo
>> > and caused it to be loaded -- usually the ClassLoader holding your
>> > other app classes.
>> >
>> > ClassLoaders can have a parent-child relationship. ClassLoaders always
>> > look in their parent before themselves.
>> >
>> > (Careful then -- in contexts like Hadoop or Tomcat where your app is
>> > loaded in a child ClassLoader, and you reference a class that Hadoop
>> > or Tomcat also has (like a lib class) you will get the container's
>> > version!)
>> >
>> > When you load an external JAR it has a separate ClassLoader which does
>> > not necessarily bear any relation to the one containing your app
>> > classes, so yeah it is not generally going to make "new Foo" work.
>> >
>> > Reflection lets you pick the ClassLoader, yes.
>> >
>> > I would not call setContextClassLoader.
>> >
>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza 
>> > wrote:
>> > > I spoke with DB offline about this a little while ago and he confirmed
>> > that
>> > > he was able to access the jar from the driver.
>> > >
>> > > The issue appears to be a general Java issue: you can't directly
>> > > instantiate a class from a dynamically loaded jar.
>> > >
>> > > I reproduced it locally outside of Spark with:
>> > > ---
>> > > URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
>> > > File("myotherjar.jar").toURI().toURL() }, null);
>> > > Thread.currentThread().setContextClassLoader(urlClassLoader);
>> > > MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
>> > > ---
>> > >
>> > > I was able to load the class with reflection.
>> >
>>


Re: TorrentBroadcast aka Cornet?

2014-05-19 Thread Matei Zaharia
TorrentBroadcast is actually slightly simpler, but it’s based on that. It has 
similar performance. I’d like to make it the default in a future version, we 
just haven’t had a ton of testing with it yet (kind of an oversight in this 
release unfortunately).

Matei

On May 19, 2014, at 12:07 AM, Andrew Ash  wrote:

> Hi Spark devs,
> 
> Is the algorithm for
> TorrentBroadcastthe
> same as Cornet from the below paper?
> 
> http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf
> 
> If so it would be nice to include a link to the paper in the Javadoc for
> the class.
> 
> Thanks!
> Andrew



queston about Spark repositories in GitHub

2014-05-19 Thread Gil Vernik
Hello,

I am new to the Spark community, so I apologize if I ask something 
obvious.
I follow the document about contribution to Spark where it's written that 
I need to fork the https://github.com/apache/spark repository.

I got a little bit confused since  the repository 
https://github.com/apache/spark  contains various branches: brach-0.5 - 
branch-1.0,  master, scala-2.9, streaming

What is the branch with the latest most updated code for the next planned 
release? I guess it's Spark 1.0, so should I work with  branch-1.0 or 
master?
Another question, what is Scala-2.9 and Streaming branches? 

Thanking you in advance,
Gil Vernik.






Re: queston about Spark repositories in GitHub

2014-05-19 Thread Matei Zaharia
“master” is where development happens, while branch-1.0, branch-0.9, etc are 
for maintenance releases in those versions. Most likely if you want to 
contribute you should use master. Some of the other named branches were for big 
features in the past, but none are actively used now.

Matei

On May 19, 2014, at 12:58 AM, Gil Vernik  wrote:

> Hello,
> 
> I am new to the Spark community, so I apologize if I ask something 
> obvious.
> I follow the document about contribution to Spark where it's written that 
> I need to fork the https://github.com/apache/spark repository.
> 
> I got a little bit confused since  the repository 
> https://github.com/apache/spark  contains various branches: brach-0.5 - 
> branch-1.0,  master, scala-2.9, streaming
> 
> What is the branch with the latest most updated code for the next planned 
> release? I guess it's Spark 1.0, so should I work with  branch-1.0 or 
> master?
> Another question, what is Scala-2.9 and Streaming branches? 
> 
> Thanking you in advance,
> Gil Vernik.
> 
> 
> 
> 



BUG: graph.triplets does not return proper values

2014-05-19 Thread GlennStrycker
graph.triplets does not work -- it returns incorrect results

I have a graph with the following edges:

orig_graph.edges.collect
=  Array(Edge(1,4,1), Edge(1,5,1), Edge(1,7,1), Edge(2,5,1), Edge(2,6,1),
Edge(3,5,1), Edge(3,6,1), Edge(3,7,1), Edge(4,1,1), Edge(5,1,1),
Edge(5,2,1), Edge(5,3,1), Edge(6,2,1), Edge(6,3,1), Edge(7,1,1),
Edge(7,3,1))

When I run triplets.collect, I only get the last edge repeated 16 times:

orig_graph.triplets.collect
= Array(((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1),
((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1),
((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1),
((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1))

I've also tried writing various map steps first before calling the triplet
function, but I get the same results as above.

Similarly, the example on the graphx programming guide page
(http://spark.apache.org/docs/0.9.0/graphx-programming-guide.html) is
incorrect.

val facts: RDD[String] =
  graph.triplets.map(triplet =>
triplet.srcAttr._1 + " is the " + triplet.attr + " of " +
triplet.dstAttr._1)

does not work, but

val facts: RDD[String] =
  graph.triplets.map(triplet =>
triplet.srcAttr + " is the " + triplet.attr + " of " + triplet.dstAttr)

does work, although the results are meaningless.  For my graph example, I
get the following line repeated 16 times:

1 is the 1 of 1



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/BUG-graph-triplets-does-not-return-proper-values-tp6693.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: BUG: graph.triplets does not return proper values

2014-05-19 Thread Reynold Xin
This was an optimization that reuses a triplet object in GraphX, and when
you do a collect directly on triplets, the same object is returned.

It has been fixed in Spark 1.0 here:
https://issues.apache.org/jira/browse/SPARK-1188

To work around in older version of Spark, you can add a copy step to it,
e.g.

graph.triplets.map(_.copy()).collect()



On Mon, May 19, 2014 at 1:09 PM, GlennStrycker wrote:

> graph.triplets does not work -- it returns incorrect results
>
> I have a graph with the following edges:
>
> orig_graph.edges.collect
> =  Array(Edge(1,4,1), Edge(1,5,1), Edge(1,7,1), Edge(2,5,1), Edge(2,6,1),
> Edge(3,5,1), Edge(3,6,1), Edge(3,7,1), Edge(4,1,1), Edge(5,1,1),
> Edge(5,2,1), Edge(5,3,1), Edge(6,2,1), Edge(6,3,1), Edge(7,1,1),
> Edge(7,3,1))
>
> When I run triplets.collect, I only get the last edge repeated 16 times:
>
> orig_graph.triplets.collect
> = Array(((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1),
> ((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1),
> ((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1),
> ((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1), ((7,1),(3,1),1))
>
> I've also tried writing various map steps first before calling the triplet
> function, but I get the same results as above.
>
> Similarly, the example on the graphx programming guide page
> (http://spark.apache.org/docs/0.9.0/graphx-programming-guide.html) is
> incorrect.
>
> val facts: RDD[String] =
>   graph.triplets.map(triplet =>
> triplet.srcAttr._1 + " is the " + triplet.attr + " of " +
> triplet.dstAttr._1)
>
> does not work, but
>
> val facts: RDD[String] =
>   graph.triplets.map(triplet =>
> triplet.srcAttr + " is the " + triplet.attr + " of " + triplet.dstAttr)
>
> does work, although the results are meaningless.  For my graph example, I
> get the following line repeated 16 times:
>
> 1 is the 1 of 1
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/BUG-graph-triplets-does-not-return-proper-values-tp6693.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


Re: BUG: graph.triplets does not return proper values

2014-05-19 Thread GlennStrycker
Thanks, rxin, this worked!

I am having a similar problem with .reduce... do I need to insert .copy()
functions in that statement as well?

This part works:
orig_graph.edges.map(_.copy()).flatMap(edge => Seq(edge) ).map(edge =>
(Edge(edge.copy().srcId, edge.copy().dstId, edge.copy().attr), 1)).collect

=Array((Edge(1,4,1),1), (Edge(1,5,1),1), (Edge(1,7,1),1), (Edge(2,5,1),1),
(Edge(2,6,1),1), (Edge(3,5,1),1), (Edge(3,6,1),1), (Edge(3,7,1),1),
(Edge(4,1,1),1), (Edge(5,1,1),1), (Edge(5,2,1),1), (Edge(5,3,1),1),
(Edge(6,2,1),1), (Edge(6,3,1),1), (Edge(7,1,1),1), (Edge(7,3,1),1))

But when I try adding on a reduce statement, I only get one element, not 16:
orig_graph.edges.map(_.copy()).flatMap(edge => Seq(edge) ).map(edge =>
(Edge(edge.copy().srcId, edge.copy().dstId, edge.copy().attr), 1)).reduce(
(A,B) => { if (A._1.dstId == B._1.srcId) (Edge(A._1.srcId, B._1.dstId, 2),
1) else if (A._1.srcId == B._1.dstId) (Edge(B._1.srcId, A._1.dstId, 2), 1)
else (Edge(0, 0, 0), 0) } )

=(Edge(0,0,0),0)



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/BUG-graph-triplets-does-not-return-proper-values-tp6693p6695.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: BUG: graph.triplets does not return proper values

2014-05-19 Thread Reynold Xin
Yea unfortunately you need that as well. When 1.0 is released, you wouldn't
need to do that anymore.

BTW - you can also just check out the source code from github to build 1.0.
The current branch-1.0 branch is very already at release candidate status -
so it should be almost identical to the actual 1.0 release.

https://github.com/apache/spark/tree/branch-1.0


On Mon, May 19, 2014 at 3:16 PM, GlennStrycker wrote:

> Thanks, rxin, this worked!
>
> I am having a similar problem with .reduce... do I need to insert .copy()
> functions in that statement as well?
>
> This part works:
> orig_graph.edges.map(_.copy()).flatMap(edge => Seq(edge) ).map(edge =>
> (Edge(edge.copy().srcId, edge.copy().dstId, edge.copy().attr), 1)).collect
>
> =Array((Edge(1,4,1),1), (Edge(1,5,1),1), (Edge(1,7,1),1), (Edge(2,5,1),1),
> (Edge(2,6,1),1), (Edge(3,5,1),1), (Edge(3,6,1),1), (Edge(3,7,1),1),
> (Edge(4,1,1),1), (Edge(5,1,1),1), (Edge(5,2,1),1), (Edge(5,3,1),1),
> (Edge(6,2,1),1), (Edge(6,3,1),1), (Edge(7,1,1),1), (Edge(7,3,1),1))
>
> But when I try adding on a reduce statement, I only get one element, not
> 16:
> orig_graph.edges.map(_.copy()).flatMap(edge => Seq(edge) ).map(edge =>
> (Edge(edge.copy().srcId, edge.copy().dstId, edge.copy().attr), 1)).reduce(
> (A,B) => { if (A._1.dstId == B._1.srcId) (Edge(A._1.srcId, B._1.dstId, 2),
> 1) else if (A._1.srcId == B._1.dstId) (Edge(B._1.srcId, A._1.dstId, 2), 1)
> else (Edge(0, 0, 0), 0) } )
>
> =(Edge(0,0,0),0)
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/BUG-graph-triplets-does-not-return-proper-values-tp6693p6695.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


Re: BUG: graph.triplets does not return proper values

2014-05-19 Thread GlennStrycker
I tried adding .copy() everywhere, but still only get one element returned,
not even an RDD object.

orig_graph.edges.map(_.copy()).flatMap(edge => Seq(edge) ).map(edge =>
(Edge(edge.copy().srcId, edge.copy().dstId, edge.copy().attr), 1)).reduce(
(A,B) => { if (A._1.copy().dstId == B._1.copy().srcId)
(Edge(A._1.copy().srcId, B._1.copy().dstId, 2), 1) else if
(A._1.copy().srcId == B._1.copy().dstId) (Edge(B._1.copy().srcId,
A._1.copy().dstId, 2), 1) else (Edge(0, 0, 3), 1) } )

= (Edge(0,0,3),1)

I'll try getting a fresh copy of the Spark 1.0 code and see if I can get it
to work.  Thanks for your help!!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/BUG-graph-triplets-does-not-return-proper-values-tp6693p6697.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


spark 1.0 standalone application

2014-05-19 Thread nit
I am not  much comfortable with sbt. I want to build a standalone application
using spark 1.0 RC9. I can build sbt assembly for my application with Spark
0.9.1, and I think in that case spark is pulled from Aka Repository?

Now if I want to use 1.0 RC9 for my application; what is the process ?
(FYI, I was able to build spark-1.0 via sbt/assembly and I can see
sbt-assembly jar; and I think I will have to copy my jar somewhere? and
update build.sbt?)

PS: I am not sure if this is the right place for this question; but since
1.0 is still RC, I felt that this may be appropriate forum.

thank! 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-0-standalone-application-tp6698.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


spark and impala , which is more fitter for MPP

2014-05-19 Thread liuguodong
Hi, ALL
 
   My question is that spark and impala , which is more fitter for MPP .
   The  motivition as below case:
1.  three big table need make join operation; (about 100 field per 
table,  more than 1TB per table)
2.  beside above tables, it is very possible to they need make join 
operation with n other small or middle table
3.  for all join operation , they will be random,  meanwhile , it need  
a very quick response.

look forward your authoritative help !

Best Regards




liuguodong

Re: spark 1.0 standalone application

2014-05-19 Thread Nan Zhu
en, you have to put spark-assembly-*.jar to the lib directory of your 
application 

Best, 

-- 
Nan Zhu


On Monday, May 19, 2014 at 9:48 PM, nit wrote:

> I am not much comfortable with sbt. I want to build a standalone application
> using spark 1.0 RC9. I can build sbt assembly for my application with Spark
> 0.9.1, and I think in that case spark is pulled from Aka Repository?
> 
> Now if I want to use 1.0 RC9 for my application; what is the process ?
> (FYI, I was able to build spark-1.0 via sbt/assembly and I can see
> sbt-assembly jar; and I think I will have to copy my jar somewhere? and
> update build.sbt?)
> 
> PS: I am not sure if this is the right place for this question; but since
> 1.0 is still RC, I felt that this may be appropriate forum.
> 
> thank! 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-0-standalone-application-tp6698.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> 




Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-19 Thread Nan Zhu
just rerun my test on rc5 

everything works

build applications with sbt and the spark-*.jar which is compiled with Hadoop 
2.3

+1 

-- 
Nan Zhu


On Sunday, May 18, 2014 at 11:07 PM, witgo wrote:

> How to reproduce this bug?
> 
> 
> -- Original --
> From: "Patrick Wendell";mailto:pwend...@gmail.com)>;
> Date: Mon, May 19, 2014 10:08 AM
> To: "dev@spark.apache.org (mailto:dev@spark.apache.org)" (mailto:dev@spark.apache.org)>; 
> Cc: "Tom Graves"mailto:tgraves...@yahoo.com)>; 
> Subject: Re: [VOTE] Release Apache Spark 1.0.0 (rc9)
> 
> 
> 
> Hey Matei - the issue you found is not related to security. This patch
> a few days ago broke builds for Hadoop 1 with YARN support enabled.
> The patch directly altered the way we deal with commons-lang
> dependency, which is what is at the base of this stack trace.
> 
> https://github.com/apache/spark/pull/754
> 
> - Patrick
> 
> On Sun, May 18, 2014 at 5:28 PM, Matei Zaharia  (mailto:matei.zaha...@gmail.com)> wrote:
> > Alright, I've opened https://github.com/apache/spark/pull/819 with the 
> > Windows fixes. I also found one other likely bug, 
> > https://issues.apache.org/jira/browse/SPARK-1875, in the binary packages 
> > for Hadoop1 built in this RC. I think this is due to Hadoop 1's security 
> > code depending on a different version of org.apache.commons than Hadoop 2, 
> > but it needs investigation. Tom, any thoughts on this?
> > 
> > Matei
> > 
> > On May 18, 2014, at 12:33 PM, Matei Zaharia  > (mailto:matei.zaha...@gmail.com)> wrote:
> > 
> > > I took the always fun task of testing it on Windows, and unfortunately, I 
> > > found some small problems with the prebuilt packages due to recent 
> > > changes to the launch scripts: bin/spark-class2.cmd looks in ./jars 
> > > instead of ./lib for the assembly JAR, and bin/run-example2.cmd doesn't 
> > > quite match the master-setting behavior of the Unix based one. I'll send 
> > > a pull request to fix them soon.
> > > 
> > > Matei
> > > 
> > > 
> > > On May 17, 2014, at 11:32 AM, Sandy Ryza  > > (mailto:sandy.r...@cloudera.com)> wrote:
> > > 
> > > > +1
> > > > 
> > > > Reran my tests from rc5:
> > > > 
> > > > * Built the release from source.
> > > > * Compiled Java and Scala apps that interact with HDFS against it.
> > > > * Ran them in local mode.
> > > > * Ran them against a pseudo-distributed YARN cluster in both yarn-client
> > > > mode and yarn-cluster mode.
> > > > 
> > > > 
> > > > On Sat, May 17, 2014 at 10:08 AM, Andrew Or  > > > (mailto:and...@databricks.com)> wrote:
> > > > 
> > > > > +1
> > > > > 
> > > > > 
> > > > > 2014-05-17 8:53 GMT-07:00 Mark Hamstra  > > > > (mailto:m...@clearstorydata.com)>:
> > > > > 
> > > > > > +1
> > > > > > 
> > > > > > 
> > > > > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell 
> > > > > > mailto:pwend...@gmail.com)
> > > > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > > I'll start the voting with a +1.
> > > > > > > 
> > > > > > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell 
> > > > > > > mailto:pwend...@gmail.com)>
> > > > > > > wrote:
> > > > > > > > Please vote on releasing the following candidate as Apache Spark
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > version
> > > > > > > 1.0.0!
> > > > > > > > This has one bug fix and one minor feature on top of rc8:
> > > > > > > > SPARK-1864: https://github.com/apache/spark/pull/808
> > > > > > > > SPARK-1808: https://github.com/apache/spark/pull/799
> > > > > > > > 
> > > > > > > > The tag to be voted on is v1.0.0-rc9 (commit 920f947):
> > > > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75
> > > > > > > > 
> > > > > > > > The release files, including signatures, digests, etc. can be 
> > > > > > > > found
> > > > > at:
> > > > > > > > http://people.apache.org/~pwendell/spark-1.0.0-rc9/
> > > > > > > > 
> > > > > > > > Release artifacts are signed with the following key:
> > > > > > > > https://people.apache.org/keys/committer/pwendell.asc
> > > > > > > > 
> > > > > > > > The staging repository for this release can be found at:
> > > > > > https://repository.apache.org/content/repositories/orgapachespark-1017/
> > > > > > > > 
> > > > > > > > The documentation corresponding to this release can be found at:
> > > > > > > > http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/
> > > > > > > > 
> > > > > > > > Please vote on releasing this package as Apache Spark 1.0.0!
> > > > > > > > 
> > > > > > > > The vote is open until Tuesday, May 20, at 08:56 UTC and passes 
> > > > > > > > if
> > > > > > > > amajority of at least 3 +1 PMC votes are cast.
> > > > > > > > 
> > > > > > > > [ ] +1 Release this package as Apache Spark 1.0.0
> > > > > > > > [ ] -1 Do not release this package because ...
> > > > > > > > 
> > > > > > > > To learn more about Apache Spark, please see
> > > > > > > > http://spark.apache.org/
> > > > > > > > 
> > > > > > > > == API Changes ==
> > > > > > > > We welcome users to compile Spa

Re: spark 1.0 standalone application

2014-05-19 Thread Mark Hamstra
That's the crude way to do it.  If you run `sbt/sbt publishLocal`, then you
can resolve the artifact from your local cache in the same way that you
would resolve it if it were deployed to a remote cache.  That's just the
build step.  Actually running the application will require the necessary
jars to be accessible by the cluster nodes.


On Mon, May 19, 2014 at 7:04 PM, Nan Zhu  wrote:

> en, you have to put spark-assembly-*.jar to the lib directory of your
> application
>
> Best,
>
> --
> Nan Zhu
>
>
> On Monday, May 19, 2014 at 9:48 PM, nit wrote:
>
> > I am not much comfortable with sbt. I want to build a standalone
> application
> > using spark 1.0 RC9. I can build sbt assembly for my application with
> Spark
> > 0.9.1, and I think in that case spark is pulled from Aka Repository?
> >
> > Now if I want to use 1.0 RC9 for my application; what is the process ?
> > (FYI, I was able to build spark-1.0 via sbt/assembly and I can see
> > sbt-assembly jar; and I think I will have to copy my jar somewhere? and
> > update build.sbt?)
> >
> > PS: I am not sure if this is the right place for this question; but since
> > 1.0 is still RC, I felt that this may be appropriate forum.
> >
> > thank!
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-0-standalone-application-tp6698.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com (http://Nabble.com).
> >
> >
>
>
>


Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread Patrick Wendell
Having a user add define a custom class inside of an added jar and
instantiate it directly inside of an executor is definitely supported
in Spark and has been for a really long time (several years). This is
something we do all the time in Spark.

DB - I'd hold off on a re-architecting of this until we identify
exactly what is causing the bug you are running into.

In a nutshell, when the bytecode "new Foo()" is run on the executor,
it will ask the driver for the class over HTTP using a custom
classloader. Something in that pipeline is breaking here, possibly
related to the YARN deployment stuff.


On Mon, May 19, 2014 at 12:29 AM, Sean Owen  wrote:
> I don't think a customer classloader is necessary.
>
> Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
> all run custom user code that creates new user objects without
> reflection. I should go see how that's done. Maybe it's totally valid
> to set the thread's context classloader for just this purpose, and I
> am not thinking clearly.
>
> On Mon, May 19, 2014 at 8:26 AM, Andrew Ash  wrote:
>> Sounds like the problem is that classloaders always look in their parents
>> before themselves, and Spark users want executors to pick up classes from
>> their custom code before the ones in Spark plus its dependencies.
>>
>> Would a custom classloader that delegates to the parent after first
>> checking itself fix this up?
>>
>>
>> On Mon, May 19, 2014 at 12:17 AM, DB Tsai  wrote:
>>
>>> Hi Sean,
>>>
>>> It's true that the issue here is classloader, and due to the classloader
>>> delegation model, users have to use reflection in the executors to pick up
>>> the classloader in order to use those classes added by sc.addJars APIs.
>>> However, it's very inconvenience for users, and not documented in spark.
>>>
>>> I'm working on a patch to solve it by calling the protected method addURL
>>> in URLClassLoader to update the current default classloader, so no
>>> customClassLoader anymore. I wonder if this is an good way to go.
>>>
>>>   private def addURL(url: URL, loader: URLClassLoader){
>>> try {
>>>   val method: Method =
>>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
>>>   method.setAccessible(true)
>>>   method.invoke(loader, url)
>>> }
>>> catch {
>>>   case t: Throwable => {
>>> throw new IOException("Error, could not add URL to system
>>> classloader")
>>>   }
>>> }
>>>   }
>>>
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ---
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen  wrote:
>>>
>>> > I might be stating the obvious for everyone, but the issue here is not
>>> > reflection or the source of the JAR, but the ClassLoader. The basic
>>> > rules are this.
>>> >
>>> > "new Foo" will use the ClassLoader that defines Foo. This is usually
>>> > the ClassLoader that loaded whatever it is that first referenced Foo
>>> > and caused it to be loaded -- usually the ClassLoader holding your
>>> > other app classes.
>>> >
>>> > ClassLoaders can have a parent-child relationship. ClassLoaders always
>>> > look in their parent before themselves.
>>> >
>>> > (Careful then -- in contexts like Hadoop or Tomcat where your app is
>>> > loaded in a child ClassLoader, and you reference a class that Hadoop
>>> > or Tomcat also has (like a lib class) you will get the container's
>>> > version!)
>>> >
>>> > When you load an external JAR it has a separate ClassLoader which does
>>> > not necessarily bear any relation to the one containing your app
>>> > classes, so yeah it is not generally going to make "new Foo" work.
>>> >
>>> > Reflection lets you pick the ClassLoader, yes.
>>> >
>>> > I would not call setContextClassLoader.
>>> >
>>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza 
>>> > wrote:
>>> > > I spoke with DB offline about this a little while ago and he confirmed
>>> > that
>>> > > he was able to access the jar from the driver.
>>> > >
>>> > > The issue appears to be a general Java issue: you can't directly
>>> > > instantiate a class from a dynamically loaded jar.
>>> > >
>>> > > I reproduced it locally outside of Spark with:
>>> > > ---
>>> > > URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
>>> > > File("myotherjar.jar").toURI().toURL() }, null);
>>> > > Thread.currentThread().setContextClassLoader(urlClassLoader);
>>> > > MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
>>> > > ---
>>> > >
>>> > > I was able to load the class with reflection.
>>> >
>>>


Re: spark 1.0 standalone application

2014-05-19 Thread Patrick Wendell
Whenever we publish a release candidate, we create a temporary maven
repository that host the artifacts. We do this precisely for the case
you are running into (where a user wants to build an application
against it to test).

You can build against the release candidate by just adding that
repository in your sbt build, then linking against "spark-core"
version "1.0.0". For rc9 the repository is in the vote e-mail:

http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc9-td6629.html

On Mon, May 19, 2014 at 7:03 PM, Mark Hamstra  wrote:
> That's the crude way to do it.  If you run `sbt/sbt publishLocal`, then you
> can resolve the artifact from your local cache in the same way that you
> would resolve it if it were deployed to a remote cache.  That's just the
> build step.  Actually running the application will require the necessary
> jars to be accessible by the cluster nodes.
>
>
> On Mon, May 19, 2014 at 7:04 PM, Nan Zhu  wrote:
>
>> en, you have to put spark-assembly-*.jar to the lib directory of your
>> application
>>
>> Best,
>>
>> --
>> Nan Zhu
>>
>>
>> On Monday, May 19, 2014 at 9:48 PM, nit wrote:
>>
>> > I am not much comfortable with sbt. I want to build a standalone
>> application
>> > using spark 1.0 RC9. I can build sbt assembly for my application with
>> Spark
>> > 0.9.1, and I think in that case spark is pulled from Aka Repository?
>> >
>> > Now if I want to use 1.0 RC9 for my application; what is the process ?
>> > (FYI, I was able to build spark-1.0 via sbt/assembly and I can see
>> > sbt-assembly jar; and I think I will have to copy my jar somewhere? and
>> > update build.sbt?)
>> >
>> > PS: I am not sure if this is the right place for this question; but since
>> > 1.0 is still RC, I felt that this may be appropriate forum.
>> >
>> > thank!
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-0-standalone-application-tp6698.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com (http://Nabble.com).
>> >
>> >
>>
>>
>>


Re: spark 1.0 standalone application

2014-05-19 Thread Sujeet Varakhedi
Threads like these are great candidates to be part of the "Contributors
guide". I will create a JIRA to update the guide with data past threads
like these.

Sujeet


On Mon, May 19, 2014 at 7:10 PM, Patrick Wendell  wrote:

> Whenever we publish a release candidate, we create a temporary maven
> repository that host the artifacts. We do this precisely for the case
> you are running into (where a user wants to build an application
> against it to test).
>
> You can build against the release candidate by just adding that
> repository in your sbt build, then linking against "spark-core"
> version "1.0.0". For rc9 the repository is in the vote e-mail:
>
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc9-td6629.html
>
> On Mon, May 19, 2014 at 7:03 PM, Mark Hamstra 
> wrote:
> > That's the crude way to do it.  If you run `sbt/sbt publishLocal`, then
> you
> > can resolve the artifact from your local cache in the same way that you
> > would resolve it if it were deployed to a remote cache.  That's just the
> > build step.  Actually running the application will require the necessary
> > jars to be accessible by the cluster nodes.
> >
> >
> > On Mon, May 19, 2014 at 7:04 PM, Nan Zhu  wrote:
> >
> >> en, you have to put spark-assembly-*.jar to the lib directory of your
> >> application
> >>
> >> Best,
> >>
> >> --
> >> Nan Zhu
> >>
> >>
> >> On Monday, May 19, 2014 at 9:48 PM, nit wrote:
> >>
> >> > I am not much comfortable with sbt. I want to build a standalone
> >> application
> >> > using spark 1.0 RC9. I can build sbt assembly for my application with
> >> Spark
> >> > 0.9.1, and I think in that case spark is pulled from Aka Repository?
> >> >
> >> > Now if I want to use 1.0 RC9 for my application; what is the process ?
> >> > (FYI, I was able to build spark-1.0 via sbt/assembly and I can see
> >> > sbt-assembly jar; and I think I will have to copy my jar somewhere?
> and
> >> > update build.sbt?)
> >> >
> >> > PS: I am not sure if this is the right place for this question; but
> since
> >> > 1.0 is still RC, I felt that this may be appropriate forum.
> >> >
> >> > thank!
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-0-standalone-application-tp6698.html
> >> > Sent from the Apache Spark Developers List mailing list archive at
> >> Nabble.com (http://Nabble.com).
> >> >
> >> >
> >>
> >>
> >>
>


Re: BUG: graph.triplets does not return proper values

2014-05-19 Thread Reynold Xin
reduce always return a single element - maybe you are misunderstanding what
the reduce function in collections does.


On Mon, May 19, 2014 at 3:32 PM, GlennStrycker wrote:

> I tried adding .copy() everywhere, but still only get one element returned,
> not even an RDD object.
>
> orig_graph.edges.map(_.copy()).flatMap(edge => Seq(edge) ).map(edge =>
> (Edge(edge.copy().srcId, edge.copy().dstId, edge.copy().attr), 1)).reduce(
> (A,B) => { if (A._1.copy().dstId == B._1.copy().srcId)
> (Edge(A._1.copy().srcId, B._1.copy().dstId, 2), 1) else if
> (A._1.copy().srcId == B._1.copy().dstId) (Edge(B._1.copy().srcId,
> A._1.copy().dstId, 2), 1) else (Edge(0, 0, 3), 1) } )
>
> = (Edge(0,0,3),1)
>
> I'll try getting a fresh copy of the Spark 1.0 code and see if I can get it
> to work.  Thanks for your help!!
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/BUG-graph-triplets-does-not-return-proper-values-tp6693p6697.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread Sandy Ryza
It just hit me why this problem is showing up on YARN and not on standalone.

The relevant difference between YARN and standalone is that, on YARN, the
app jar is loaded by the system classloader instead of Spark's custom URL
classloader.

On YARN, the system classloader knows about [the classes in the spark jars,
the classes in the primary app jar].   The custom classloader knows about
[the classes in secondary app jars] and has the system classloader as its
parent.

A few relevant facts (mostly redundant with what Sean pointed out):
* Every class has a classloader that loaded it.
* When an object of class B is instantiated inside of class A, the
classloader used for loading B is the classloader that was used for loading
A.
* When a classloader fails to load a class, it lets its parent classloader
try.  If its parent succeeds, its parent becomes the "classloader that
loaded it".

So suppose class B is in a secondary app jar and class A is in the primary
app jar:
1. The custom classloader will try to load class A.
2. It will fail, because it only knows about the secondary jars.
3. It will delegate to its parent, the system classloader.
4. The system classloader will succeed, because it knows about the primary
app jar.
5. A's classloader will be the system classloader.
6. A tries to instantiate an instance of class B.
7. B will be loaded with A's classloader, which is the system classloader.
8. Loading B will fail, because A's classloader, which is the system
classloader, doesn't know about the secondary app jars.

In Spark standalone, A and B are both loaded by the custom classloader, so
this issue doesn't come up.

-Sandy

On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell  wrote:

> Having a user add define a custom class inside of an added jar and
> instantiate it directly inside of an executor is definitely supported
> in Spark and has been for a really long time (several years). This is
> something we do all the time in Spark.
>
> DB - I'd hold off on a re-architecting of this until we identify
> exactly what is causing the bug you are running into.
>
> In a nutshell, when the bytecode "new Foo()" is run on the executor,
> it will ask the driver for the class over HTTP using a custom
> classloader. Something in that pipeline is breaking here, possibly
> related to the YARN deployment stuff.
>
>
> On Mon, May 19, 2014 at 12:29 AM, Sean Owen  wrote:
> > I don't think a customer classloader is necessary.
> >
> > Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
> > all run custom user code that creates new user objects without
> > reflection. I should go see how that's done. Maybe it's totally valid
> > to set the thread's context classloader for just this purpose, and I
> > am not thinking clearly.
> >
> > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash 
> wrote:
> >> Sounds like the problem is that classloaders always look in their
> parents
> >> before themselves, and Spark users want executors to pick up classes
> from
> >> their custom code before the ones in Spark plus its dependencies.
> >>
> >> Would a custom classloader that delegates to the parent after first
> >> checking itself fix this up?
> >>
> >>
> >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai  wrote:
> >>
> >>> Hi Sean,
> >>>
> >>> It's true that the issue here is classloader, and due to the
> classloader
> >>> delegation model, users have to use reflection in the executors to
> pick up
> >>> the classloader in order to use those classes added by sc.addJars APIs.
> >>> However, it's very inconvenience for users, and not documented in
> spark.
> >>>
> >>> I'm working on a patch to solve it by calling the protected method
> addURL
> >>> in URLClassLoader to update the current default classloader, so no
> >>> customClassLoader anymore. I wonder if this is an good way to go.
> >>>
> >>>   private def addURL(url: URL, loader: URLClassLoader){
> >>> try {
> >>>   val method: Method =
> >>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
> >>>   method.setAccessible(true)
> >>>   method.invoke(loader, url)
> >>> }
> >>> catch {
> >>>   case t: Throwable => {
> >>> throw new IOException("Error, could not add URL to system
> >>> classloader")
> >>>   }
> >>> }
> >>>   }
> >>>
> >>>
> >>>
> >>> Sincerely,
> >>>
> >>> DB Tsai
> >>> ---
> >>> My Blog: https://www.dbtsai.com
> >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >>>
> >>>
> >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen 
> wrote:
> >>>
> >>> > I might be stating the obvious for everyone, but the issue here is
> not
> >>> > reflection or the source of the JAR, but the ClassLoader. The basic
> >>> > rules are this.
> >>> >
> >>> > "new Foo" will use the ClassLoader that defines Foo. This is usually
> >>> > the ClassLoader that loaded whatever it is that first referenced Foo
> >>> > and caused it to be loaded -- usually the ClassLoader holding your
> >>> > other

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread DB Tsai
Good summary! We fixed it in branch 0.9 since our production is still in
0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
tonight.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza  wrote:

> It just hit me why this problem is showing up on YARN and not on
> standalone.
>
> The relevant difference between YARN and standalone is that, on YARN, the
> app jar is loaded by the system classloader instead of Spark's custom URL
> classloader.
>
> On YARN, the system classloader knows about [the classes in the spark jars,
> the classes in the primary app jar].   The custom classloader knows about
> [the classes in secondary app jars] and has the system classloader as its
> parent.
>
> A few relevant facts (mostly redundant with what Sean pointed out):
> * Every class has a classloader that loaded it.
> * When an object of class B is instantiated inside of class A, the
> classloader used for loading B is the classloader that was used for loading
> A.
> * When a classloader fails to load a class, it lets its parent classloader
> try.  If its parent succeeds, its parent becomes the "classloader that
> loaded it".
>
> So suppose class B is in a secondary app jar and class A is in the primary
> app jar:
> 1. The custom classloader will try to load class A.
> 2. It will fail, because it only knows about the secondary jars.
> 3. It will delegate to its parent, the system classloader.
> 4. The system classloader will succeed, because it knows about the primary
> app jar.
> 5. A's classloader will be the system classloader.
> 6. A tries to instantiate an instance of class B.
> 7. B will be loaded with A's classloader, which is the system classloader.
> 8. Loading B will fail, because A's classloader, which is the system
> classloader, doesn't know about the secondary app jars.
>
> In Spark standalone, A and B are both loaded by the custom classloader, so
> this issue doesn't come up.
>
> -Sandy
>
> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell 
> wrote:
>
> > Having a user add define a custom class inside of an added jar and
> > instantiate it directly inside of an executor is definitely supported
> > in Spark and has been for a really long time (several years). This is
> > something we do all the time in Spark.
> >
> > DB - I'd hold off on a re-architecting of this until we identify
> > exactly what is causing the bug you are running into.
> >
> > In a nutshell, when the bytecode "new Foo()" is run on the executor,
> > it will ask the driver for the class over HTTP using a custom
> > classloader. Something in that pipeline is breaking here, possibly
> > related to the YARN deployment stuff.
> >
> >
> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen  wrote:
> > > I don't think a customer classloader is necessary.
> > >
> > > Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
> > > all run custom user code that creates new user objects without
> > > reflection. I should go see how that's done. Maybe it's totally valid
> > > to set the thread's context classloader for just this purpose, and I
> > > am not thinking clearly.
> > >
> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash 
> > wrote:
> > >> Sounds like the problem is that classloaders always look in their
> > parents
> > >> before themselves, and Spark users want executors to pick up classes
> > from
> > >> their custom code before the ones in Spark plus its dependencies.
> > >>
> > >> Would a custom classloader that delegates to the parent after first
> > >> checking itself fix this up?
> > >>
> > >>
> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai 
> wrote:
> > >>
> > >>> Hi Sean,
> > >>>
> > >>> It's true that the issue here is classloader, and due to the
> > classloader
> > >>> delegation model, users have to use reflection in the executors to
> > pick up
> > >>> the classloader in order to use those classes added by sc.addJars
> APIs.
> > >>> However, it's very inconvenience for users, and not documented in
> > spark.
> > >>>
> > >>> I'm working on a patch to solve it by calling the protected method
> > addURL
> > >>> in URLClassLoader to update the current default classloader, so no
> > >>> customClassLoader anymore. I wonder if this is an good way to go.
> > >>>
> > >>>   private def addURL(url: URL, loader: URLClassLoader){
> > >>> try {
> > >>>   val method: Method =
> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
> > >>>   method.setAccessible(true)
> > >>>   method.invoke(loader, url)
> > >>> }
> > >>> catch {
> > >>>   case t: Throwable => {
> > >>> throw new IOException("Error, could not add URL to system
> > >>> classloader")
> > >>>   }
> > >>> }
> > >>>   }
> > >>>
> > >>>
> > >>>
> > >>> Sincerely,
> > >>>
> > >>> DB Tsai
> > >>> ---
> > >>> My 

Re: TorrentBroadcast aka Cornet?

2014-05-19 Thread Andrew Ash
Thanks for the info Matei.

Andrew


On Mon, May 19, 2014 at 12:38 AM, Matei Zaharia wrote:

> TorrentBroadcast is actually slightly simpler, but it’s based on that. It
> has similar performance. I’d like to make it the default in a future
> version, we just haven’t had a ton of testing with it yet (kind of an
> oversight in this release unfortunately).
>
> Matei
>
> On May 19, 2014, at 12:07 AM, Andrew Ash  wrote:
>
> > Hi Spark devs,
> >
> > Is the algorithm for
> > TorrentBroadcast<
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala
> >the
> > same as Cornet from the below paper?
> >
> > http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf
> >
> > If so it would be nice to include a link to the paper in the Javadoc for
> > the class.
> >
> > Thanks!
> > Andrew
>
>


Re: spark 1.0 standalone application

2014-05-19 Thread nit
Thanks everyone. I followed Patrick's suggestion and it worked like a charm.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-0-standalone-application-tp6698p6710.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread DB Tsai
In 1.0, there is a new option for users to choose which classloader has
higher priority via spark.files.userClassPathFirst, I decided to submit the
PR for 0.9 first. We use this patch in our lab and we can use those jars
added by sc.addJar without reflection.

https://github.com/apache/spark/pull/834

Can anyone comment if it's a good approach?

Thanks.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, May 19, 2014 at 7:42 PM, DB Tsai  wrote:

> Good summary! We fixed it in branch 0.9 since our production is still in
> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
> tonight.
>
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza wrote:
>
>> It just hit me why this problem is showing up on YARN and not on
>> standalone.
>>
>> The relevant difference between YARN and standalone is that, on YARN, the
>> app jar is loaded by the system classloader instead of Spark's custom URL
>> classloader.
>>
>> On YARN, the system classloader knows about [the classes in the spark
>> jars,
>> the classes in the primary app jar].   The custom classloader knows about
>> [the classes in secondary app jars] and has the system classloader as its
>> parent.
>>
>> A few relevant facts (mostly redundant with what Sean pointed out):
>> * Every class has a classloader that loaded it.
>> * When an object of class B is instantiated inside of class A, the
>> classloader used for loading B is the classloader that was used for
>> loading
>> A.
>> * When a classloader fails to load a class, it lets its parent classloader
>> try.  If its parent succeeds, its parent becomes the "classloader that
>> loaded it".
>>
>> So suppose class B is in a secondary app jar and class A is in the primary
>> app jar:
>> 1. The custom classloader will try to load class A.
>> 2. It will fail, because it only knows about the secondary jars.
>> 3. It will delegate to its parent, the system classloader.
>> 4. The system classloader will succeed, because it knows about the primary
>> app jar.
>> 5. A's classloader will be the system classloader.
>> 6. A tries to instantiate an instance of class B.
>> 7. B will be loaded with A's classloader, which is the system classloader.
>> 8. Loading B will fail, because A's classloader, which is the system
>> classloader, doesn't know about the secondary app jars.
>>
>> In Spark standalone, A and B are both loaded by the custom classloader, so
>> this issue doesn't come up.
>>
>> -Sandy
>>
>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell 
>> wrote:
>>
>> > Having a user add define a custom class inside of an added jar and
>> > instantiate it directly inside of an executor is definitely supported
>> > in Spark and has been for a really long time (several years). This is
>> > something we do all the time in Spark.
>> >
>> > DB - I'd hold off on a re-architecting of this until we identify
>> > exactly what is causing the bug you are running into.
>> >
>> > In a nutshell, when the bytecode "new Foo()" is run on the executor,
>> > it will ask the driver for the class over HTTP using a custom
>> > classloader. Something in that pipeline is breaking here, possibly
>> > related to the YARN deployment stuff.
>> >
>> >
>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen  wrote:
>> > > I don't think a customer classloader is necessary.
>> > >
>> > > Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
>> > > all run custom user code that creates new user objects without
>> > > reflection. I should go see how that's done. Maybe it's totally valid
>> > > to set the thread's context classloader for just this purpose, and I
>> > > am not thinking clearly.
>> > >
>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash 
>> > wrote:
>> > >> Sounds like the problem is that classloaders always look in their
>> > parents
>> > >> before themselves, and Spark users want executors to pick up classes
>> > from
>> > >> their custom code before the ones in Spark plus its dependencies.
>> > >>
>> > >> Would a custom classloader that delegates to the parent after first
>> > >> checking itself fix this up?
>> > >>
>> > >>
>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai 
>> wrote:
>> > >>
>> > >>> Hi Sean,
>> > >>>
>> > >>> It's true that the issue here is classloader, and due to the
>> > classloader
>> > >>> delegation model, users have to use reflection in the executors to
>> > pick up
>> > >>> the classloader in order to use those classes added by sc.addJars
>> APIs.
>> > >>> However, it's very inconvenience for users, and not documented in
>> > spark.
>> > >>>
>> > >>> I'm working on a patch to solve it by calling the protected method
>> > addURL
>> > >>> in URLClassLoader to update the current default classloader, so no
>> > >>> customClass

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-19 Thread Patrick Wendell
We're cancelling this RC in favor of rc10. There were two blockers: an
issue with Windows run scripts and an issue with the packaging for
Hadoop 1 when hive support is bundled.

https://issues.apache.org/jira/browse/SPARK-1875
https://issues.apache.org/jira/browse/SPARK-1876

Thanks everyone for the testing. TD will be cutting rc10, since I'm
travelling this week (thanks TD!).

- Patrick

On Mon, May 19, 2014 at 7:06 PM, Nan Zhu  wrote:
> just rerun my test on rc5
>
> everything works
>
> build applications with sbt and the spark-*.jar which is compiled with Hadoop 
> 2.3
>
> +1
>
> --
> Nan Zhu
>
>
> On Sunday, May 18, 2014 at 11:07 PM, witgo wrote:
>
>> How to reproduce this bug?
>>
>>
>> -- Original --
>> From: "Patrick Wendell";mailto:pwend...@gmail.com)>;
>> Date: Mon, May 19, 2014 10:08 AM
>> To: "dev@spark.apache.org 
>> (mailto:dev@spark.apache.org)"> (mailto:dev@spark.apache.org)>;
>> Cc: "Tom Graves"mailto:tgraves...@yahoo.com)>;
>> Subject: Re: [VOTE] Release Apache Spark 1.0.0 (rc9)
>>
>>
>>
>> Hey Matei - the issue you found is not related to security. This patch
>> a few days ago broke builds for Hadoop 1 with YARN support enabled.
>> The patch directly altered the way we deal with commons-lang
>> dependency, which is what is at the base of this stack trace.
>>
>> https://github.com/apache/spark/pull/754
>>
>> - Patrick
>>
>> On Sun, May 18, 2014 at 5:28 PM, Matei Zaharia > (mailto:matei.zaha...@gmail.com)> wrote:
>> > Alright, I've opened https://github.com/apache/spark/pull/819 with the 
>> > Windows fixes. I also found one other likely bug, 
>> > https://issues.apache.org/jira/browse/SPARK-1875, in the binary packages 
>> > for Hadoop1 built in this RC. I think this is due to Hadoop 1's security 
>> > code depending on a different version of org.apache.commons than Hadoop 2, 
>> > but it needs investigation. Tom, any thoughts on this?
>> >
>> > Matei
>> >
>> > On May 18, 2014, at 12:33 PM, Matei Zaharia > > (mailto:matei.zaha...@gmail.com)> wrote:
>> >
>> > > I took the always fun task of testing it on Windows, and unfortunately, 
>> > > I found some small problems with the prebuilt packages due to recent 
>> > > changes to the launch scripts: bin/spark-class2.cmd looks in ./jars 
>> > > instead of ./lib for the assembly JAR, and bin/run-example2.cmd doesn't 
>> > > quite match the master-setting behavior of the Unix based one. I'll send 
>> > > a pull request to fix them soon.
>> > >
>> > > Matei
>> > >
>> > >
>> > > On May 17, 2014, at 11:32 AM, Sandy Ryza > > > (mailto:sandy.r...@cloudera.com)> wrote:
>> > >
>> > > > +1
>> > > >
>> > > > Reran my tests from rc5:
>> > > >
>> > > > * Built the release from source.
>> > > > * Compiled Java and Scala apps that interact with HDFS against it.
>> > > > * Ran them in local mode.
>> > > > * Ran them against a pseudo-distributed YARN cluster in both 
>> > > > yarn-client
>> > > > mode and yarn-cluster mode.
>> > > >
>> > > >
>> > > > On Sat, May 17, 2014 at 10:08 AM, Andrew Or > > > > (mailto:and...@databricks.com)> wrote:
>> > > >
>> > > > > +1
>> > > > >
>> > > > >
>> > > > > 2014-05-17 8:53 GMT-07:00 Mark Hamstra > > > > > (mailto:m...@clearstorydata.com)>:
>> > > > >
>> > > > > > +1
>> > > > > >
>> > > > > >
>> > > > > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell 
>> > > > > > mailto:pwend...@gmail.com)
>> > > > > > > wrote:
>> > > > > >
>> > > > > >
>> > > > > > > I'll start the voting with a +1.
>> > > > > > >
>> > > > > > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell 
>> > > > > > > mailto:pwend...@gmail.com)>
>> > > > > > > wrote:
>> > > > > > > > Please vote on releasing the following candidate as Apache 
>> > > > > > > > Spark
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > > version
>> > > > > > > 1.0.0!
>> > > > > > > > This has one bug fix and one minor feature on top of rc8:
>> > > > > > > > SPARK-1864: https://github.com/apache/spark/pull/808
>> > > > > > > > SPARK-1808: https://github.com/apache/spark/pull/799
>> > > > > > > >
>> > > > > > > > The tag to be voted on is v1.0.0-rc9 (commit 920f947):
>> > > > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75
>> > > > > > > >
>> > > > > > > > The release files, including signatures, digests, etc. can be 
>> > > > > > > > found
>> > > > > at:
>> > > > > > > > http://people.apache.org/~pwendell/spark-1.0.0-rc9/
>> > > > > > > >
>> > > > > > > > Release artifacts are signed with the following key:
>> > > > > > > > https://people.apache.org/keys/committer/pwendell.asc
>> > > > > > > >
>> > > > > > > > The staging repository for this release can be found at:
>> > > > > > https://repository.apache.org/content/repositories/orgapachespark-1017/
>> > > > > > > >
>> > > > > > > > The documentation corresponding to this release can be found 
>> > > > > > > > at:
>> > > > > > > > http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/
>> > > > > > > >
>> > > > > > > > Please vote on releasi

Re: spark 1.0 standalone application

2014-05-19 Thread Nan Zhu
First time to know there is a temporary maven repository…….

--  
Nan Zhu


On Monday, May 19, 2014 at 10:10 PM, Patrick Wendell wrote:

> Whenever we publish a release candidate, we create a temporary maven
> repository that host the artifacts. We do this precisely for the case
> you are running into (where a user wants to build an application
> against it to test).
>  
> You can build against the release candidate by just adding that
> repository in your sbt build, then linking against "spark-core"
> version "1.0.0". For rc9 the repository is in the vote e-mail:
>  
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc9-td6629.html
>  
> On Mon, May 19, 2014 at 7:03 PM, Mark Hamstra  (mailto:m...@clearstorydata.com)> wrote:
> > That's the crude way to do it. If you run `sbt/sbt publishLocal`, then you
> > can resolve the artifact from your local cache in the same way that you
> > would resolve it if it were deployed to a remote cache. That's just the
> > build step. Actually running the application will require the necessary
> > jars to be accessible by the cluster nodes.
> >  
> >  
> > On Mon, May 19, 2014 at 7:04 PM, Nan Zhu  > (mailto:zhunanmcg...@gmail.com)> wrote:
> >  
> > > en, you have to put spark-assembly-*.jar to the lib directory of your
> > > application
> > >  
> > > Best,
> > >  
> > > --
> > > Nan Zhu
> > >  
> > >  
> > > On Monday, May 19, 2014 at 9:48 PM, nit wrote:
> > >  
> > > > I am not much comfortable with sbt. I want to build a standalone
> > > application
> > > > using spark 1.0 RC9. I can build sbt assembly for my application with
> > >  
> > > Spark
> > > > 0.9.1, and I think in that case spark is pulled from Aka Repository?
> > > >  
> > > > Now if I want to use 1.0 RC9 for my application; what is the process ?
> > > > (FYI, I was able to build spark-1.0 via sbt/assembly and I can see
> > > > sbt-assembly jar; and I think I will have to copy my jar somewhere? and
> > > > update build.sbt?)
> > > >  
> > > > PS: I am not sure if this is the right place for this question; but 
> > > > since
> > > > 1.0 is still RC, I felt that this may be appropriate forum.
> > > >  
> > > > thank!
> > > >  
> > > >  
> > > >  
> > > > --
> > > > View this message in context:
> > > >  
> > >  
> > > http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-0-standalone-application-tp6698.html
> > > > Sent from the Apache Spark Developers List mailing list archive at
> > >  
> > > Nabble.com (http://Nabble.com).
> > > >  
> > >  
> > >  
> >  
> >  
>  
>  
>  




Re: spark 1.0 standalone application

2014-05-19 Thread Shivaram Venkataraman
On a related note there is also a staging Apache repository where the
latest rc gets pushed to
https://repository.apache.org/content/repositories/staging/org/apache/spark/spark-core_2.10/--

The artifact here is just named "1.0.0" (similar to the rc specific
repository that Patrick mentioned). So if you just want to build you app
against the latest staging RC you can add "
https://repository.apache.org/content/repositories/staging"; to your
resolvers in SBT / Maven.

Thanks
Shivaram


On Mon, May 19, 2014 at 10:23 PM, Nan Zhu  wrote:

> First time to know there is a temporary maven repository…….
>
> --
> Nan Zhu
>
>
> On Monday, May 19, 2014 at 10:10 PM, Patrick Wendell wrote:
>
> > Whenever we publish a release candidate, we create a temporary maven
> > repository that host the artifacts. We do this precisely for the case
> > you are running into (where a user wants to build an application
> > against it to test).
> >
> > You can build against the release candidate by just adding that
> > repository in your sbt build, then linking against "spark-core"
> > version "1.0.0". For rc9 the repository is in the vote e-mail:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc9-td6629.html
> >
> > On Mon, May 19, 2014 at 7:03 PM, Mark Hamstra 
> >  m...@clearstorydata.com)> wrote:
> > > That's the crude way to do it. If you run `sbt/sbt publishLocal`, then
> you
> > > can resolve the artifact from your local cache in the same way that you
> > > would resolve it if it were deployed to a remote cache. That's just the
> > > build step. Actually running the application will require the necessary
> > > jars to be accessible by the cluster nodes.
> > >
> > >
> > > On Mon, May 19, 2014 at 7:04 PM, Nan Zhu  zhunanmcg...@gmail.com)> wrote:
> > >
> > > > en, you have to put spark-assembly-*.jar to the lib directory of your
> > > > application
> > > >
> > > > Best,
> > > >
> > > > --
> > > > Nan Zhu
> > > >
> > > >
> > > > On Monday, May 19, 2014 at 9:48 PM, nit wrote:
> > > >
> > > > > I am not much comfortable with sbt. I want to build a standalone
> > > > application
> > > > > using spark 1.0 RC9. I can build sbt assembly for my application
> with
> > > >
> > > > Spark
> > > > > 0.9.1, and I think in that case spark is pulled from Aka
> Repository?
> > > > >
> > > > > Now if I want to use 1.0 RC9 for my application; what is the
> process ?
> > > > > (FYI, I was able to build spark-1.0 via sbt/assembly and I can see
> > > > > sbt-assembly jar; and I think I will have to copy my jar
> somewhere? and
> > > > > update build.sbt?)
> > > > >
> > > > > PS: I am not sure if this is the right place for this question;
> but since
> > > > > 1.0 is still RC, I felt that this may be appropriate forum.
> > > > >
> > > > > thank!
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > View this message in context:
> > > > >
> > > >
> > > >
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-0-standalone-application-tp6698.html
> > > > > Sent from the Apache Spark Developers List mailing list archive at
> > > >
> > > > Nabble.com (http://Nabble.com).
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
>
>
>