Re: Isolate 1 partition and perform computations

2018-04-14 Thread Matthias Boehm
you might wanna have a look into using a PartitionPruningRDD to select a subset of partitions by ID. This approach worked very well for multi-key lookups for us [1]. A major advantage compared to scan-based operations is that, if your source RDD has an existing partitioner, only relevant partition

Re: Fair scheduler pool leak

2018-04-07 Thread Matthias Boehm
ically as Jobs execute? If so, > that is a very unusual thing to do. Scheduling pools are intended to be > statically configured -- initialized, living and dying with the Application. > > On Sat, Apr 7, 2018 at 12:33 AM, Matthias Boehm wrote: >> >> Thanks for the c

Re: Fair scheduler pool leak

2018-04-07 Thread Matthias Boehm
pool named "default" is you > don't define your own "default". > > On Sat, Apr 7, 2018 at 2:32 PM, Matthias Boehm wrote: >> >> No, these pools are not created per job but per parfor worker and >> thus, used to execute many jobs. For all scripts with a

Re: Fair scheduler pool leak

2018-04-07 Thread Matthias Boehm
the use case for creating pools this way? > > Also if I understand correctly, it doesn't even matter if the thread dies -- > that pool will still stay around, as the rootPool will retain a reference to > its (the pools aren't really actually tied to specific threads). > &

Fair scheduler pool leak

2018-04-05 Thread Matthias Boehm
Hi all, for concurrent Spark jobs spawned from the driver, we use Spark's fair scheduler pools, which are set and unset in a thread-local manner by each worker thread. Typically (for rather long jobs), this works very well. Unfortunately, in an application with lots of very short parallel sections

Broadcast Memory Management

2017-09-20 Thread Matthias Boehm
Hi all, could someone please help me understand the broadcast life cycle in detail, especially with regard to memory management? After reading through the TorrentBroadcast implementation, it seems that for every broadcast object, the driver holds a strong reference to a shallow copy (in MEMORY_AN

Feedback on JavaPairRDD API

2017-04-15 Thread Matthias Boehm
After using Spark for many years, including SystemML's Spark backend, I'd like to give some feedback on potential PairRDD API extensions that I would find very useful: 1) MapToPair with preservesPartitioning flag: For many binary operations with broadcasts, we always need to use mapPartitionsToPai