you might wanna have a look into using a PartitionPruningRDD to select
a subset of partitions by ID. This approach worked very well for
multi-key lookups for us [1].
A major advantage compared to scan-based operations is that, if your
source RDD has an existing partitioner, only relevant partition
ically as Jobs execute? If so,
> that is a very unusual thing to do. Scheduling pools are intended to be
> statically configured -- initialized, living and dying with the Application.
>
> On Sat, Apr 7, 2018 at 12:33 AM, Matthias Boehm wrote:
>>
>> Thanks for the c
pool named "default" is you
> don't define your own "default".
>
> On Sat, Apr 7, 2018 at 2:32 PM, Matthias Boehm wrote:
>>
>> No, these pools are not created per job but per parfor worker and
>> thus, used to execute many jobs. For all scripts with a
the use case for creating pools this way?
>
> Also if I understand correctly, it doesn't even matter if the thread dies --
> that pool will still stay around, as the rootPool will retain a reference to
> its (the pools aren't really actually tied to specific threads).
>
&
Hi all,
for concurrent Spark jobs spawned from the driver, we use Spark's fair
scheduler pools, which are set and unset in a thread-local manner by
each worker thread. Typically (for rather long jobs), this works very
well. Unfortunately, in an application with lots of very short
parallel sections
Hi all,
could someone please help me understand the broadcast life cycle in detail,
especially with regard to memory management?
After reading through the TorrentBroadcast implementation, it seems that
for every broadcast object, the driver holds a strong reference to a
shallow copy (in MEMORY_AN
After using Spark for many years, including SystemML's Spark backend, I'd
like to give some feedback on potential PairRDD API extensions that I would
find very useful:
1) MapToPair with preservesPartitioning flag: For many binary operations
with broadcasts, we always need to use mapPartitionsToPai