Is each partition guaranteed to execute in a single thread in a worker? Thanks N B
On Fri, Jan 29, 2016 at 6:53 PM, Shixiong(Ryan) Zhu <[email protected] > wrote: > I see. Then you should use `mapPartitions` rather than using ThreadLocal. > E.g., > > dstream.mapPartitions( iter -> > val d = new SomeClass(); > return iter.map { p => > somefunc(p, d.get()) > }; > }; ); > > > On Fri, Jan 29, 2016 at 5:29 PM, N B <[email protected]> wrote: > >> Well won't the code in lambda execute inside multiple threads in the >> worker because it has to process many records? I would just want to have a >> single copy of SomeClass instantiated per thread rather than once per each >> record being processed. That was what triggered this thought anyways. >> >> Thanks >> NB >> >> >> On Fri, Jan 29, 2016 at 5:09 PM, Shixiong(Ryan) Zhu < >> [email protected]> wrote: >> >>> It looks weird. Why don't you just pass "new SomeClass()" to >>> "somefunc"? You don't need to use ThreadLocal if there are no multiple >>> threads in your codes. >>> >>> On Fri, Jan 29, 2016 at 4:39 PM, N B <[email protected]> wrote: >>> >>>> Fixed a typo in the code to avoid any confusion.... Please comment on >>>> the code below... >>>> >>>> dstream.map( p -> { ThreadLocal<SomeClass> d = new ThreadLocal<>() { >>>> public SomeClass initialValue() { return new SomeClass(); } >>>> }; >>>> somefunc(p, d.get()); >>>> d.remove(); >>>> return p; >>>> }; ); >>>> >>>> On Fri, Jan 29, 2016 at 4:32 PM, N B <[email protected]> wrote: >>>> >>>>> So this use of ThreadLocal will be inside the code of a function >>>>> executing on the workers i.e. within a call from one of the lambdas. Would >>>>> it just look like this then: >>>>> >>>>> dstream.map( p -> { ThreadLocal<Data> d = new ThreadLocal<>() { >>>>> public SomeClass initialValue() { return new SomeClass(); } >>>>> }; >>>>> somefunc(p, d.get()); >>>>> d.remove(); >>>>> return p; >>>>> }; ); >>>>> >>>>> Will this make sure that all threads inside the worker clean up the >>>>> ThreadLocal once they are done with processing this task? >>>>> >>>>> Thanks >>>>> NB >>>>> >>>>> >>>>> On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu < >>>>> [email protected]> wrote: >>>>> >>>>>> Spark Streaming uses threadpools so you need to remove ThreadLocal >>>>>> when it's not used. >>>>>> >>>>>> On Fri, Jan 29, 2016 at 12:55 PM, N B <[email protected]> wrote: >>>>>> >>>>>>> Thanks for the response Ryan. So I would say that it is in fact the >>>>>>> purpose of a ThreadLocal i.e. to have a copy of the variable as long as >>>>>>> the >>>>>>> thread lives. I guess my concern is around usage of threadpools and >>>>>>> whether >>>>>>> Spark streaming will internally create many threads that rotate between >>>>>>> tasks on purpose thereby holding onto ThreadLocals that may actually >>>>>>> never >>>>>>> be used again. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Of cause. If you use a ThreadLocal in a long living thread and >>>>>>>> forget to remove it, it's definitely a memory leak. >>>>>>>> >>>>>>>> On Thu, Jan 28, 2016 at 9:31 PM, N B <[email protected]> wrote: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> Does anyone know if there are any potential pitfalls associated >>>>>>>>> with using ThreadLocal variables in a Spark streaming application? One >>>>>>>>> things I have seen mentioned in the context of app servers that use >>>>>>>>> thread >>>>>>>>> pools is that ThreadLocals can leak memory. Could this happen in Spark >>>>>>>>> streaming also? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Nikunj >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
