I know it isn't exactly what you are asking for, but you could solve it like this:
Driver program queries dynamo for the s3 file keys. sc.textFile each of the file keys and .union them all together to make your RDD. You could wrap that up in a function and it wouldn't be too painful to reuse. I don't personally know about creating custom RDDs in Java. On Mon, May 25, 2015 at 10:37 PM, Swaranga Sarma <[email protected]> wrote: > My data is in S3 and is indexed in Dynamo. For example, If I want to load > data given a time range, I will first need to query Dynamo for the S3 file > keys for the corresponding time range and then load them in Spark. The > files may not always be in the same S3 path prefix, hence > sc.testFile("s3://directory_path/") won't > work. I am looking for pointers on how to implement something analogous to > HadoopRDD or JdbcRDD but in Java. > > I am looking to do something similar to what they have done here: > https://github.com/lagerspetz/TimeSeriesSpark/blob/master/src/spark/timeseries/dynamodb/DynamoDbRDD.scala. > This one reads data from Dynamo, my custom RDD would query DynamoDB for the > S3 file keys, and then load them from S3. > > On Mon, May 25, 2015 at 8:19 PM, Alex Robbins < > [email protected]> wrote: > >> If a Hadoop InputFormat already exists for your data source, you can load >> it from there. Otherwise, maybe you can dump your data source out as text >> and load it from there. Without more detail on what your data source is, >> it'll be hard for anyone to help. >> >> On Mon, May 25, 2015 at 5:00 PM, swaranga <[email protected]> >> wrote: >> >>> Hello, >>> >>> I have a custom data source and I want to load the data into Spark to >>> perform some computations. For this I see that I might need to implement >>> a >>> new RDD for my data source. >>> >>> I am a complete Scala noob and I am hoping that I can implement the RDD >>> in >>> Java only. I looked around the internet and could not find any resources. >>> Any pointers? >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Implementing-custom-RDD-in-Java-tp23026.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >> > > > -- > Sent from my Lumia thumb-typed with errors. >
