Hi Spark gurus, At Medium we're using Spark for an ETL job that scans DynamoDB tables and loads into Redshift. Currently I use a parallel scanner implementation that writes files to local disk, then have Spark read them as a DataFrame.
Ideally we could read the DynamoDB table directly as a DataFrame, so I started putting together a data source at https://github.com/traviscrawford/spark-dynamodb A few questions: * What's the best way to incrementally build the RDD[Row] returned by "buildScan"? Currently I make an RDD[Row] from each page of results, and union them together. Does this approach seem reasonable? Any suggestions for a better way? * Currently my stand-alone scanner creates separate threads for each scan segment. I could use that same approach and create threads in the Spark driver, though ideally each scan segment would run in an executor. Any tips on how to get the segment scanners to run on Spark executors? Thanks, Travis