Hi Reynold, Thanks for the tips. I made some changes based on your suggestion, and now the table scan happens on executors. https://github.com/traviscrawford/spark-dynamodb/blob/master/src/main/scala/com/github/traviscrawford/spark/dynamodb/DynamoDBRelation.scala
sqlContext.sparkContext .parallelize(scanConfigs, scanConfigs.length) .flatMap(DynamoDBRelation.scan) When scanning a DynamoDB table, you create one or more scan requests, which I'll call "segments." Each segment provides an iterator over pages, and each page contains a collection of items. The actual network transfer happens when getting the next page. At that point you can iterate over items in memory. Based on your feedback I now parallelize a collection of configs that describe each segment to scan, then in flatMap create the scanner and fetch all it's items. I'm pointed enough in the right direction to finish this up. Thanks, Travis On Wed, Apr 13, 2016 at 10:40 AM Reynold Xin <r...@databricks.com> wrote: > Responses inline > > On Wed, Apr 13, 2016 at 7:45 AM, Travis Crawford <traviscrawf...@gmail.com > > wrote: > >> Hi Spark gurus, >> >> At Medium we're using Spark for an ETL job that scans DynamoDB tables and >> loads into Redshift. Currently I use a parallel scanner implementation that >> writes files to local disk, then have Spark read them as a DataFrame. >> >> Ideally we could read the DynamoDB table directly as a DataFrame, so I >> started putting together a data source at >> https://github.com/traviscrawford/spark-dynamodb >> >> A few questions: >> >> * What's the best way to incrementally build the RDD[Row] returned by >> "buildScan"? Currently I make an RDD[Row] from each page of results, and >> union them together. Does this approach seem reasonable? Any suggestions >> for a better way? >> > > If the number of pages can be high (e.g. > 100), it is best to avoid using > union. The simpler way is ... > > val pages = ... > sc.parallelize(pages, pages.size).flatMap { page => > ... > } > > The above creates a task per page. > > Looking at your code, you are relying on Spark's JSON inference to read > the JSON data. You would need a different thing there in order to > parallelize this. Right now you are bringing all the data into the driver > and then send them out. > > > >> >> * Currently my stand-alone scanner creates separate threads for each scan >> segment. I could use that same approach and create threads in the Spark >> driver, though ideally each scan segment would run in an executor. Any tips >> on how to get the segment scanners to run on Spark executors? >> > > I'm not too familiar with dynamo. Is segment different from the page above? > > > >> >> Thanks, >> Travis >> > >