You can try to define a wrapper class for your parser, and create an
instance of your parser in companion object as a singleton object.
Thus, even you create an object of wrapper in mapPartition every time,
each JVM will have only a single instance of your parser object.

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Aug 4, 2014 at 2:01 AM, Fengyun RAO <raofeng...@gmail.com> wrote:
> Thanks, Sean!
>
> It works, but as the link in 2 - Why Is My Spark Job so Slow and Only Using
> a Single Thread? says " parser instance is now a singleton created in the
> scope of our driver program" which I thought was in the scope of executor.
> Am I wrong, or why?
>
> I didn't want the equivalent of "setup()" method, since I want to share the
> "parser" among tasks in the same worker node. It takes tens of seconds to
> initialize a "parser". What's more, I want to know if the "parser" could
> have a field such as ConcurrentHashMap which all tasks in the node may get()
> of put() items.
>
>
>
>
> 2014-08-04 16:35 GMT+08:00 Sean Owen <so...@cloudera.com>:
>
>> The parser does not need to be serializable. In the line:
>>
>> lines.map(line => JSONParser.parse(line))
>>
>> ... the parser is called but there is no parser object that with state
>> that can be serialized. Are you sure it does not work?
>>
>> The error message alluded to originally refers to an object not shown
>> in the code, so I'm not 100% sure this was the original issue.
>>
>> If you want, the equivalent of "setup()" is really "writing some code
>> at the start of a call to mapPartitions()"
>>
>> On Mon, Aug 4, 2014 at 8:40 AM, Fengyun RAO <raofeng...@gmail.com> wrote:
>> > Thanks, Ron.
>> >
>> > The problem is that the "parser" is written in another package which is
>> > not
>> > serializable.
>> >
>> > In mapreduce, I could create the "parser" in the map setup() method.
>> >
>> > Now in spark, I want to create it for each worker, and share it among
>> > all
>> > the tasks on the same work node.
>> >
>> > I know different workers run on different machine, but it doesn't have
>> > to
>> > communicate between workers.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to