You can try to define a wrapper class for your parser, and create an instance of your parser in companion object as a singleton object. Thus, even you create an object of wrapper in mapPartition every time, each JVM will have only a single instance of your parser object.
Sincerely, DB Tsai ------------------------------------------------------- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Aug 4, 2014 at 2:01 AM, Fengyun RAO <raofeng...@gmail.com> wrote: > Thanks, Sean! > > It works, but as the link in 2 - Why Is My Spark Job so Slow and Only Using > a Single Thread? says " parser instance is now a singleton created in the > scope of our driver program" which I thought was in the scope of executor. > Am I wrong, or why? > > I didn't want the equivalent of "setup()" method, since I want to share the > "parser" among tasks in the same worker node. It takes tens of seconds to > initialize a "parser". What's more, I want to know if the "parser" could > have a field such as ConcurrentHashMap which all tasks in the node may get() > of put() items. > > > > > 2014-08-04 16:35 GMT+08:00 Sean Owen <so...@cloudera.com>: > >> The parser does not need to be serializable. In the line: >> >> lines.map(line => JSONParser.parse(line)) >> >> ... the parser is called but there is no parser object that with state >> that can be serialized. Are you sure it does not work? >> >> The error message alluded to originally refers to an object not shown >> in the code, so I'm not 100% sure this was the original issue. >> >> If you want, the equivalent of "setup()" is really "writing some code >> at the start of a call to mapPartitions()" >> >> On Mon, Aug 4, 2014 at 8:40 AM, Fengyun RAO <raofeng...@gmail.com> wrote: >> > Thanks, Ron. >> > >> > The problem is that the "parser" is written in another package which is >> > not >> > serializable. >> > >> > In mapreduce, I could create the "parser" in the map setup() method. >> > >> > Now in spark, I want to create it for each worker, and share it among >> > all >> > the tasks on the same work node. >> > >> > I know different workers run on different machine, but it doesn't have >> > to >> > communicate between workers. > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org