I'm not sure if this is what you're looking for but we have several custom
RDD implementations for internal data format/partitioning schemes.

The Spark api is really simple and consists primarily of being able to
implement 3 simple things:

1) You need a class that extends RDD that's lightweight because it needs to
be Serializable to machines on a cluster (therefore it shouldn't actually
contain the data, for example).
2) That class needs to implement getPartitions() to generate an array of
serializable Partition instances.
3) That class needs to implement compute(Partition p, TaskContext t) which
will (likely) be executed on a deserialized copy of your RDD class and
provided a deserialized instance of one of the partitions returned from
getPartitions() and needs to return an Iterator for the actual data within
the partition.





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/how-to-implement-my-own-datasource-tp12881p12902.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to