Hi Spark gurus,

At Medium we're using Spark for an ETL job that scans DynamoDB tables and
loads into Redshift. Currently I use a parallel scanner implementation that
writes files to local disk, then have Spark read them as a DataFrame.

Ideally we could read the DynamoDB table directly as a DataFrame, so I
started putting together a data source at
https://github.com/traviscrawford/spark-dynamodb

A few questions:

* What's the best way to incrementally build the RDD[Row] returned by
"buildScan"? Currently I make an RDD[Row] from each page of results, and
union them together. Does this approach seem reasonable? Any suggestions
for a better way?

* Currently my stand-alone scanner creates separate threads for each scan
segment. I could use that same approach and create threads in the Spark
driver, though ideally each scan segment would run in an executor. Any tips
on how to get the segment scanners to run on Spark executors?

Thanks,
Travis

Reply via email to