thank you guys, I'll read examples and give a try.

On Fri, Jun 26, 2015 at 2:47 AM, jimfcarroll <jimfcarr...@gmail.com> wrote:

>
> I'm not sure if this is what you're looking for but we have several custom
> RDD implementations for internal data format/partitioning schemes.
>
> The Spark api is really simple and consists primarily of being able to
> implement 3 simple things:
>
> 1) You need a class that extends RDD that's lightweight because it needs to
> be Serializable to machines on a cluster (therefore it shouldn't actually
> contain the data, for example).
> 2) That class needs to implement getPartitions() to generate an array of
> serializable Partition instances.
> 3) That class needs to implement compute(Partition p, TaskContext t) which
> will (likely) be executed on a deserialized copy of your RDD class and
> provided a deserialized instance of one of the partitions returned from
> getPartitions() and needs to return an Iterator for the actual data within
> the partition.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/how-to-implement-my-own-datasource-tp12881p12902.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Reply via email to