thank you guys, I'll read examples and give a try. On Fri, Jun 26, 2015 at 2:47 AM, jimfcarroll <jimfcarr...@gmail.com> wrote:
> > I'm not sure if this is what you're looking for but we have several custom > RDD implementations for internal data format/partitioning schemes. > > The Spark api is really simple and consists primarily of being able to > implement 3 simple things: > > 1) You need a class that extends RDD that's lightweight because it needs to > be Serializable to machines on a cluster (therefore it shouldn't actually > contain the data, for example). > 2) That class needs to implement getPartitions() to generate an array of > serializable Partition instances. > 3) That class needs to implement compute(Partition p, TaskContext t) which > will (likely) be executed on a deserialized copy of your RDD class and > provided a deserialized instance of one of the partitions returned from > getPartitions() and needs to return an Iterator for the actual data within > the partition. > > > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/how-to-implement-my-own-datasource-tp12881p12902.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >