Emlyn, Have you considered using pools? http://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools
I haven't tried that by myself, but it looks like pool setting is applied per thread so that means it's possible to configure fair scheduler, so that more, than one job is on a go. Although each of them would probably use less number of workers... Hope this helps. -- Be well! Jean Morozov On Thu, Jan 21, 2016 at 3:23 PM, emlyn <em...@swiftkey.com> wrote: > Thanks for the responses (not sure why they aren't showing up on the list). > > Michael wrote: > > The JDBC wrapper for Redshift should allow you to follow these > > instructions. Let me know if you run into any more issues. > > > http://apache-spark-user-list.1001560.n3.nabble.com/best-practices-for-pushing-an-RDD-into-a-database-td2681.html > > I'm not sure that this solves my problem - if I understand it correctly, > this is to split a database write over multiple concurrent connections (one > from each partition), whereas what I want is to allow other tasks to > continue running on the cluster while the the write to Redshift is taking > place. > Also I don't think it's good practice to load data into Redshift with > INSERT > statements over JDBC - it is recommended to use the bulk load commands that > can analyse the data and automatically set appropriate compression etc on > the table. > > > Rajesh wrote: > > Just a thought. Can we use Spark Job Server and trigger jobs through rest > > apis. In this case, all jobs will share same context and run the jobs > > parallel. > > If any one has other thoughts please share > > I'm not sure this would work in my case as they are not completely separate > jobs, but just different outputs to Redshift, that share intermediate > results. Running them as completely separate jobs would mean recalculating > the intermediate results for each output. I suppose it might be possible to > persist the intermediate results somewhere, and then delete them once all > the jobs have run, but that is starting to add a lot of complication which > I'm not sure is justified. > > > Maybe some pseudocode would help clarify things, so here is a very > simplified view of our Spark application: > > // load and transform data, then cache the result > df1 = transform1(sqlCtx.read().options(...).parquet('path/to/data')) > df1.cache() > > // perform some further transforms of the cached data > df2 = transform2(df1) > df3 = transform3(df1) > > // write the final data out to Redshift > df2.write().options(...).(format "com.databricks.spark.redshift").save() > df3.write().options(...).(format "com.databricks.spark.redshift").save() > > > When the application runs, the steps are executed in the following order: > - scan parquet folder > - transform1 executes > - df1 stored in cache > - transform2 executes > - df2 written to Redshift (while cluster sits idle) > - transform3 executes > - df3 written to Redshift > > I would like transform3 to begin executing as soon as the cluster has > capacity, without having to wait for df2 to be written to Redshift, so I > tried rewriting the last two lines as (again pseudocode): > > f1 = future{df2.write().options(...).(format > "com.databricks.spark.redshift").save()}.execute() > f2 = future{df3.write().options(...).(format > "com.databricks.spark.redshift").save()}.execute() > f1.get() > f2.get() > > In the hope that the first write would no longer block the following steps, > but instead it fails with a TimeoutException (see stack trace in previous > message). Is there a way to start the different writes concurrently, or is > that not possible in Spark? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011p26030.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >