Hello,
I am trying to run a Spark job that hits an external webservice to get back
some information. The cluster is 1 master + 4 workers, each worker has 60GB
RAM and 4 CPUs. The external webservice is a standalone Solr server, and is
accessed using code similar to that shown below.
def getResults(keyValues: Iterator[(String, Array[String])]):
> Iterator[(String, String)] = {
> val solr = new HttpSolrClient()
> initializeSolrParameters(solr)
> keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
> }
> myRDD.repartition(10)
.mapPartitions(keyValues => getResults(keyValues))
>
The mapPartitions does some initialization to the SolrJ client per
partition and then hits it for each record in the partition via the
getResults() call.
I repartitioned in the hope that this will result in 10 clients hitting
Solr simultaneously (I would like to go upto maybe 30-40 simultaneous
clients if I can). However, I counted the number of open connections using
"netstat -anp | grep ":8983.*ESTABLISHED" in a loop on the Solr box and
observed that Solr has a constant 4 clients (ie, equal to the number of
workers) over the lifetime of the run.
My observation leads me to believe that each worker processes a single
stream of work sequentially. However, from what I understand about how
Spark works, each worker should be able to process number of tasks
parallelly, and that repartition() is a hint for it to do so.
Is there some SparkConf environment variable I should set to increase
parallelism in these workers, or should I just configure a cluster with
multiple workers per machine? Or is there something I am doing wrong?
Thank you in advance for any pointers you can provide.
-sujit