Re: question on make multiple external calls within each partition

Tathagata Das Mon, 05 Oct 2015 18:16:03 -0700

You could create a threadpool on demand within the foreachPartitoin
function, then handoff the REST calls to that threadpool, get back the
futures and wait for them to finish. Should be pretty straightforward. Make
sure that your foreachPartition function cleans up the threadpool before
finishing. Alternatively, you can create an on-demand singleton threadpool
that is reused across batches, will reduce the cost of creating threadpools
everytime.


On Mon, Oct 5, 2015 at 6:07 PM, Ashish Soni <asoni.le...@gmail.com> wrote:

> Need more details but you might want to filter the data first ( create
> multiple RDD) and then process.
>
>
> > On Oct 5, 2015, at 8:35 PM, Chen Song <chen.song...@gmail.com> wrote:
> >
> > We have a use case with the following design in Spark Streaming.
> >
> > Within each batch,
> > * data is read and partitioned by some key
> > * forEachPartition is used to process the entire partition
> > * within each partition, there are several REST clients created to
> connect to different REST services
> > * for the list of records within each partition, it will call these
> services, each service call is independent of others; records are just
> pre-partitioned to make these calls more efficiently.
> >
> > I have a question
> > * Since each call is time taking and to prevent the calls to be executed
> sequentially, how can I parallelize the service calls within processing of
> each partition? Can I just use Scala future within forEachPartition(or
> mapPartitions)?
> >
> > Any suggestions greatly appreciated.
> >
> > Chen
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: question on make multiple external calls within each partition

Reply via email to