Re: Spark streaming and ThreadLocal

N B Mon, 01 Feb 2016 23:28:20 -0800

Is each partition guaranteed to execute in a single thread in a worker?

Thanks
N B



On Fri, Jan 29, 2016 at 6:53 PM, Shixiong(Ryan) Zhu <[email protected]
> wrote:

> I see. Then you should use `mapPartitions` rather than using ThreadLocal.
> E.g.,
>
> dstream.mapPartitions( iter ->
>     val d = new SomeClass();
>     return iter.map { p =>
>        somefunc(p, d.get())
>     };
> }; );
>
>
> On Fri, Jan 29, 2016 at 5:29 PM, N B <[email protected]> wrote:
>
>> Well won't the code in lambda execute inside multiple threads in the
>> worker because it has to process many records? I would just want to have a
>> single copy of SomeClass instantiated per thread rather than once per each
>> record being processed. That was what triggered this thought anyways.
>>
>> Thanks
>> NB
>>
>>
>> On Fri, Jan 29, 2016 at 5:09 PM, Shixiong(Ryan) Zhu <
>> [email protected]> wrote:
>>
>>> It looks weird. Why don't you just pass "new SomeClass()" to
>>> "somefunc"? You don't need to use ThreadLocal if there are no multiple
>>> threads in your codes.
>>>
>>> On Fri, Jan 29, 2016 at 4:39 PM, N B <[email protected]> wrote:
>>>
>>>> Fixed a typo in the code to avoid any confusion.... Please comment on
>>>> the code below...
>>>>
>>>> dstream.map( p -> { ThreadLocal<SomeClass> d = new ThreadLocal<>() {
>>>>          public SomeClass initialValue() { return new SomeClass(); }
>>>>     };
>>>>     somefunc(p, d.get());
>>>>     d.remove();
>>>>     return p;
>>>> }; );
>>>>
>>>> On Fri, Jan 29, 2016 at 4:32 PM, N B <[email protected]> wrote:
>>>>
>>>>> So this use of ThreadLocal will be inside the code of a function
>>>>> executing on the workers i.e. within a call from one of the lambdas. Would
>>>>> it just look like this then:
>>>>>
>>>>> dstream.map( p -> { ThreadLocal<Data> d = new ThreadLocal<>() {
>>>>>          public SomeClass initialValue() { return new SomeClass(); }
>>>>>     };
>>>>>     somefunc(p, d.get());
>>>>>     d.remove();
>>>>>     return p;
>>>>> }; );
>>>>>
>>>>> Will this make sure that all threads inside the worker clean up the
>>>>> ThreadLocal once they are done with processing this task?
>>>>>
>>>>> Thanks
>>>>> NB
>>>>>
>>>>>
>>>>> On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Spark Streaming uses threadpools so you need to remove ThreadLocal
>>>>>> when it's not used.
>>>>>>
>>>>>> On Fri, Jan 29, 2016 at 12:55 PM, N B <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks for the response Ryan. So I would say that it is in fact the
>>>>>>> purpose of a ThreadLocal i.e. to have a copy of the variable as long as 
>>>>>>> the
>>>>>>> thread lives. I guess my concern is around usage of threadpools and 
>>>>>>> whether
>>>>>>> Spark streaming will internally create many threads that rotate between
>>>>>>> tasks on purpose thereby holding onto ThreadLocals that may actually 
>>>>>>> never
>>>>>>> be used again.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Of cause. If you use a ThreadLocal in a long living thread and
>>>>>>>> forget to remove it, it's definitely a memory leak.
>>>>>>>>
>>>>>>>> On Thu, Jan 28, 2016 at 9:31 PM, N B <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> Does anyone know if there are any potential pitfalls associated
>>>>>>>>> with using ThreadLocal variables in a Spark streaming application? One
>>>>>>>>> things I have seen mentioned in the context of app servers that use 
>>>>>>>>> thread
>>>>>>>>> pools is that ThreadLocals can leak memory. Could this happen in Spark
>>>>>>>>> streaming also?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Nikunj
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark streaming and ThreadLocal

Reply via email to