You can also call rdd.saveAsHadoopDataset and use the DBOutputFormat that
Hadoop provides:
http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/lib/db/DBOutputFormat.html


On Thu, Mar 13, 2014 at 4:17 PM, Patrick Wendell <pwend...@gmail.com> wrote:

> Hey Nicholas,
>
> The best way to do this is to do rdd.mapPartitions() and pass a
> function that will open a JDBC connection to your database and write
> the range in each partition.
>
> On the input path there is something called JDBC-RDD that is relevant:
>
> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.JdbcRDD
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala#L73
>
> - Patrick
>
> On Thu, Mar 13, 2014 at 2:05 PM, Nicholas Chammas
> <nicholas.cham...@gmail.com> wrote:
> > My fellow welders,
> >
> > (Can we make that a thing? Let's make that a thing. :)
> >
> > I'm trying to wedge Spark into an existing model where we process and
> > transform some data and then load it into an MPP database. I know that
> part
> > of the sell of Spark and Shark is that you shouldn't have to copy data
> > around like this, so please bear with me. :)
> >
> > Say I have an RDD of about 10GB in size that's cached in memory. What is
> the
> > best/fastest way to push that data into an MPP database like Redshift?
> Has
> > anyone done something like this?
> >
> > I'm assuming that pushing the data straight from memory into the
> database is
> > much faster than writing the RDD to HDFS and then COPY-ing it from there
> > into the database.
> >
> > Is there, for example, a way to perform a bulk load into the database
> that
> > runs on each partition of the in-memory RDD in parallel?
> >
> > Nick
> >
> >
> > ________________________________
> > View this message in context: best practices for pushing an RDD into a
> > database
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to