I'm not sure if this is what you're looking for but we have several custom RDD implementations for internal data format/partitioning schemes.
The Spark api is really simple and consists primarily of being able to implement 3 simple things: 1) You need a class that extends RDD that's lightweight because it needs to be Serializable to machines on a cluster (therefore it shouldn't actually contain the data, for example). 2) That class needs to implement getPartitions() to generate an array of serializable Partition instances. 3) That class needs to implement compute(Partition p, TaskContext t) which will (likely) be executed on a deserialized copy of your RDD class and provided a deserialized instance of one of the partitions returned from getPartitions() and needs to return an Iterator for the actual data within the partition. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/how-to-implement-my-own-datasource-tp12881p12902.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org