You could create a threadpool on demand within the foreachPartitoin function, then handoff the REST calls to that threadpool, get back the futures and wait for them to finish. Should be pretty straightforward. Make sure that your foreachPartition function cleans up the threadpool before finishing. Alternatively, you can create an on-demand singleton threadpool that is reused across batches, will reduce the cost of creating threadpools everytime.
On Mon, Oct 5, 2015 at 6:07 PM, Ashish Soni <asoni.le...@gmail.com> wrote: > Need more details but you might want to filter the data first ( create > multiple RDD) and then process. > > > > On Oct 5, 2015, at 8:35 PM, Chen Song <chen.song...@gmail.com> wrote: > > > > We have a use case with the following design in Spark Streaming. > > > > Within each batch, > > * data is read and partitioned by some key > > * forEachPartition is used to process the entire partition > > * within each partition, there are several REST clients created to > connect to different REST services > > * for the list of records within each partition, it will call these > services, each service call is independent of others; records are just > pre-partitioned to make these calls more efficiently. > > > > I have a question > > * Since each call is time taking and to prevent the calls to be executed > sequentially, how can I parallelize the service calls within processing of > each partition? Can I just use Scala future within forEachPartition(or > mapPartitions)? > > > > Any suggestions greatly appreciated. > > > > Chen > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >