I think adding jars dynamically should work as long as the primary jar and the secondary jars do not depend on dynamically added jars, which should be the correct logic. -Xiangrui
On Wed, May 21, 2014 at 1:40 PM, DB Tsai <dbt...@stanford.edu> wrote: > This will be another separate story. > > Since in the yarn deployment, as Sandy said, the app.jar will be always in > the systemclassloader which means any object instantiated in app.jar will > have parent loader of systemclassloader instead of custom one. As a result, > the custom class loader in yarn will never work without specifically using > reflection. > > Solution will be not using system classloader in the classloader hierarchy, > and add all the resources in system one into custom one. This is the > approach of tomcat takes. > > Or we can directly overwirte the system class loader by calling the > protected method `addURL` which will not work and throw exception if the > code is wrapped in security manager. > > > Sincerely, > > DB Tsai > ------------------------------------------------------- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > >> This will solve the issue for jars added upon application submission, but, >> on top of this, we need to make sure that anything dynamically added >> through sc.addJar works as well. >> >> To do so, we need to make sure that any jars retrieved via the driver's >> HTTP server are loaded by the same classloader that loads the jars given on >> app submission. To achieve this, we need to either use the same >> classloader for both system jars and user jars, or make sure that the user >> jars given on app submission are under the same classloader used for >> dynamically added jars. >> >> On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng <men...@gmail.com> wrote: >> >> > Talked with Sandy and DB offline. I think the best solution is sending >> > the secondary jars to the distributed cache of all containers rather >> > than just the master, and set the classpath to include spark jar, >> > primary app jar, and secondary jars before executor starts. In this >> > way, user only needs to specify secondary jars via --jars instead of >> > calling sc.addJar inside the code. It also solves the scalability >> > problem of serving all the jars via http. >> > >> > If this solution sounds good, I can try to make a patch. >> > >> > Best, >> > Xiangrui >> > >> > On Mon, May 19, 2014 at 10:04 PM, DB Tsai <dbt...@stanford.edu> wrote: >> > > In 1.0, there is a new option for users to choose which classloader has >> > > higher priority via spark.files.userClassPathFirst, I decided to submit >> > the >> > > PR for 0.9 first. We use this patch in our lab and we can use those >> jars >> > > added by sc.addJar without reflection. >> > > >> > > https://github.com/apache/spark/pull/834 >> > > >> > > Can anyone comment if it's a good approach? >> > > >> > > Thanks. >> > > >> > > >> > > Sincerely, >> > > >> > > DB Tsai >> > > ------------------------------------------------------- >> > > My Blog: https://www.dbtsai.com >> > > LinkedIn: https://www.linkedin.com/in/dbtsai >> > > >> > > >> > > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <dbt...@stanford.edu> wrote: >> > > >> > >> Good summary! We fixed it in branch 0.9 since our production is still >> in >> > >> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0 >> > >> tonight. >> > >> >> > >> >> > >> Sincerely, >> > >> >> > >> DB Tsai >> > >> ------------------------------------------------------- >> > >> My Blog: https://www.dbtsai.com >> > >> LinkedIn: https://www.linkedin.com/in/dbtsai >> > >> >> > >> >> > >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <sandy.r...@cloudera.com >> > >wrote: >> > >> >> > >>> It just hit me why this problem is showing up on YARN and not on >> > >>> standalone. >> > >>> >> > >>> The relevant difference between YARN and standalone is that, on YARN, >> > the >> > >>> app jar is loaded by the system classloader instead of Spark's custom >> > URL >> > >>> classloader. >> > >>> >> > >>> On YARN, the system classloader knows about [the classes in the spark >> > >>> jars, >> > >>> the classes in the primary app jar]. The custom classloader knows >> > about >> > >>> [the classes in secondary app jars] and has the system classloader as >> > its >> > >>> parent. >> > >>> >> > >>> A few relevant facts (mostly redundant with what Sean pointed out): >> > >>> * Every class has a classloader that loaded it. >> > >>> * When an object of class B is instantiated inside of class A, the >> > >>> classloader used for loading B is the classloader that was used for >> > >>> loading >> > >>> A. >> > >>> * When a classloader fails to load a class, it lets its parent >> > classloader >> > >>> try. If its parent succeeds, its parent becomes the "classloader >> that >> > >>> loaded it". >> > >>> >> > >>> So suppose class B is in a secondary app jar and class A is in the >> > primary >> > >>> app jar: >> > >>> 1. The custom classloader will try to load class A. >> > >>> 2. It will fail, because it only knows about the secondary jars. >> > >>> 3. It will delegate to its parent, the system classloader. >> > >>> 4. The system classloader will succeed, because it knows about the >> > primary >> > >>> app jar. >> > >>> 5. A's classloader will be the system classloader. >> > >>> 6. A tries to instantiate an instance of class B. >> > >>> 7. B will be loaded with A's classloader, which is the system >> > classloader. >> > >>> 8. Loading B will fail, because A's classloader, which is the system >> > >>> classloader, doesn't know about the secondary app jars. >> > >>> >> > >>> In Spark standalone, A and B are both loaded by the custom >> > classloader, so >> > >>> this issue doesn't come up. >> > >>> >> > >>> -Sandy >> > >>> >> > >>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <pwend...@gmail.com >> > >> > >>> wrote: >> > >>> >> > >>> > Having a user add define a custom class inside of an added jar and >> > >>> > instantiate it directly inside of an executor is definitely >> supported >> > >>> > in Spark and has been for a really long time (several years). This >> is >> > >>> > something we do all the time in Spark. >> > >>> > >> > >>> > DB - I'd hold off on a re-architecting of this until we identify >> > >>> > exactly what is causing the bug you are running into. >> > >>> > >> > >>> > In a nutshell, when the bytecode "new Foo()" is run on the >> executor, >> > >>> > it will ask the driver for the class over HTTP using a custom >> > >>> > classloader. Something in that pipeline is breaking here, possibly >> > >>> > related to the YARN deployment stuff. >> > >>> > >> > >>> > >> > >>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com> >> > wrote: >> > >>> > > I don't think a customer classloader is necessary. >> > >>> > > >> > >>> > > Well, it occurs to me that this is no new problem. Hadoop, >> Tomcat, >> > etc >> > >>> > > all run custom user code that creates new user objects without >> > >>> > > reflection. I should go see how that's done. Maybe it's totally >> > valid >> > >>> > > to set the thread's context classloader for just this purpose, >> and >> > I >> > >>> > > am not thinking clearly. >> > >>> > > >> > >>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash < >> and...@andrewash.com> >> > >>> > wrote: >> > >>> > >> Sounds like the problem is that classloaders always look in >> their >> > >>> > parents >> > >>> > >> before themselves, and Spark users want executors to pick up >> > classes >> > >>> > from >> > >>> > >> their custom code before the ones in Spark plus its >> dependencies. >> > >>> > >> >> > >>> > >> Would a custom classloader that delegates to the parent after >> > first >> > >>> > >> checking itself fix this up? >> > >>> > >> >> > >>> > >> >> > >>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <dbt...@stanford.edu> >> > >>> wrote: >> > >>> > >> >> > >>> > >>> Hi Sean, >> > >>> > >>> >> > >>> > >>> It's true that the issue here is classloader, and due to the >> > >>> > classloader >> > >>> > >>> delegation model, users have to use reflection in the executors >> > to >> > >>> > pick up >> > >>> > >>> the classloader in order to use those classes added by >> sc.addJars >> > >>> APIs. >> > >>> > >>> However, it's very inconvenience for users, and not documented >> in >> > >>> > spark. >> > >>> > >>> >> > >>> > >>> I'm working on a patch to solve it by calling the protected >> > method >> > >>> > addURL >> > >>> > >>> in URLClassLoader to update the current default classloader, so >> > no >> > >>> > >>> customClassLoader anymore. I wonder if this is an good way to >> go. >> > >>> > >>> >> > >>> > >>> private def addURL(url: URL, loader: URLClassLoader){ >> > >>> > >>> try { >> > >>> > >>> val method: Method = >> > >>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL", >> classOf[URL]) >> > >>> > >>> method.setAccessible(true) >> > >>> > >>> method.invoke(loader, url) >> > >>> > >>> } >> > >>> > >>> catch { >> > >>> > >>> case t: Throwable => { >> > >>> > >>> throw new IOException("Error, could not add URL to >> system >> > >>> > >>> classloader") >> > >>> > >>> } >> > >>> > >>> } >> > >>> > >>> } >> > >>> > >>> >> > >>> > >>> >> > >>> > >>> >> > >>> > >>> Sincerely, >> > >>> > >>> >> > >>> > >>> DB Tsai >> > >>> > >>> ------------------------------------------------------- >> > >>> > >>> My Blog: https://www.dbtsai.com >> > >>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai >> > >>> > >>> >> > >>> > >>> >> > >>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen < >> so...@cloudera.com> >> > >>> > wrote: >> > >>> > >>> >> > >>> > >>> > I might be stating the obvious for everyone, but the issue >> > here is >> > >>> > not >> > >>> > >>> > reflection or the source of the JAR, but the ClassLoader. The >> > >>> basic >> > >>> > >>> > rules are this. >> > >>> > >>> > >> > >>> > >>> > "new Foo" will use the ClassLoader that defines Foo. This is >> > >>> usually >> > >>> > >>> > the ClassLoader that loaded whatever it is that first >> > referenced >> > >>> Foo >> > >>> > >>> > and caused it to be loaded -- usually the ClassLoader holding >> > your >> > >>> > >>> > other app classes. >> > >>> > >>> > >> > >>> > >>> > ClassLoaders can have a parent-child relationship. >> ClassLoaders >> > >>> > always >> > >>> > >>> > look in their parent before themselves. >> > >>> > >>> > >> > >>> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where your >> > app >> > >>> is >> > >>> > >>> > loaded in a child ClassLoader, and you reference a class that >> > >>> Hadoop >> > >>> > >>> > or Tomcat also has (like a lib class) you will get the >> > container's >> > >>> > >>> > version!) >> > >>> > >>> > >> > >>> > >>> > When you load an external JAR it has a separate ClassLoader >> > which >> > >>> > does >> > >>> > >>> > not necessarily bear any relation to the one containing your >> > app >> > >>> > >>> > classes, so yeah it is not generally going to make "new Foo" >> > work. >> > >>> > >>> > >> > >>> > >>> > Reflection lets you pick the ClassLoader, yes. >> > >>> > >>> > >> > >>> > >>> > I would not call setContextClassLoader. >> > >>> > >>> > >> > >>> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza < >> > >>> > sandy.r...@cloudera.com> >> > >>> > >>> > wrote: >> > >>> > >>> > > I spoke with DB offline about this a little while ago and >> he >> > >>> > confirmed >> > >>> > >>> > that >> > >>> > >>> > > he was able to access the jar from the driver. >> > >>> > >>> > > >> > >>> > >>> > > The issue appears to be a general Java issue: you can't >> > directly >> > >>> > >>> > > instantiate a class from a dynamically loaded jar. >> > >>> > >>> > > >> > >>> > >>> > > I reproduced it locally outside of Spark with: >> > >>> > >>> > > --- >> > >>> > >>> > > URLClassLoader urlClassLoader = new URLClassLoader(new >> > >>> URL[] { >> > >>> > new >> > >>> > >>> > > File("myotherjar.jar").toURI().toURL() }, null); >> > >>> > >>> > > >> > >>> Thread.currentThread().setContextClassLoader(urlClassLoader); >> > >>> > >>> > > MyClassFromMyOtherJar obj = new >> MyClassFromMyOtherJar(); >> > >>> > >>> > > --- >> > >>> > >>> > > >> > >>> > >>> > > I was able to load the class with reflection. >> > >>> > >>> > >> > >>> > >>> >> > >>> > >> > >>> >> > >> >> > >> >> > >>