You can also call rdd.saveAsHadoopDataset and use the DBOutputFormat that Hadoop provides: http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/lib/db/DBOutputFormat.html
On Thu, Mar 13, 2014 at 4:17 PM, Patrick Wendell <pwend...@gmail.com> wrote: > Hey Nicholas, > > The best way to do this is to do rdd.mapPartitions() and pass a > function that will open a JDBC connection to your database and write > the range in each partition. > > On the input path there is something called JDBC-RDD that is relevant: > > http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.JdbcRDD > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala#L73 > > - Patrick > > On Thu, Mar 13, 2014 at 2:05 PM, Nicholas Chammas > <nicholas.cham...@gmail.com> wrote: > > My fellow welders, > > > > (Can we make that a thing? Let's make that a thing. :) > > > > I'm trying to wedge Spark into an existing model where we process and > > transform some data and then load it into an MPP database. I know that > part > > of the sell of Spark and Shark is that you shouldn't have to copy data > > around like this, so please bear with me. :) > > > > Say I have an RDD of about 10GB in size that's cached in memory. What is > the > > best/fastest way to push that data into an MPP database like Redshift? > Has > > anyone done something like this? > > > > I'm assuming that pushing the data straight from memory into the > database is > > much faster than writing the RDD to HDFS and then COPY-ing it from there > > into the database. > > > > Is there, for example, a way to perform a bulk load into the database > that > > runs on each partition of the in-memory RDD in parallel? > > > > Nick > > > > > > ________________________________ > > View this message in context: best practices for pushing an RDD into a > > database > > Sent from the Apache Spark User List mailing list archive at Nabble.com. >