It can reuse. That's a good point and we should document it in the API contract.
On Tue, Apr 21, 2015 at 4:06 PM, Punyashloka Biswal <punya.bis...@gmail.com> wrote: > Reynold, thanks for this! At Palantir we're heavy users of the Java APIs > and appreciate being able to stop hacking around with fake ClassTags :) > > Regarding this specific proposal, is the contract of RecordReader#get > intended to be that it returns a fresh object each time? Or is it allowed > to mutate a fixed object and return a pointer to it each time? > > Put another way, is a caller supposed to clone the output of get() if they > want to use it later? > > Punya > > On Tue, Apr 21, 2015 at 4:35 PM Reynold Xin <r...@databricks.com> wrote: > >> I created a pull request last night for a new InputSource API that is >> essentially a stripped down version of the RDD API for providing data into >> Spark. Would be great to hear the community's feedback. >> >> Spark currently has two de facto input source API: >> 1. RDD >> 2. Hadoop MapReduce InputFormat >> >> Neither of the above is ideal: >> >> 1. RDD: It is hard for Java developers to implement RDD, given the >> implicit >> class tags. In addition, the RDD API depends on Scala's runtime library, >> which does not preserve binary compatibility across Scala versions. If a >> developer chooses Java to implement an input source, it would be great if >> that input source can be binary compatible in years to come. >> >> 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. >> For example, it forces key-value semantics, and does not support running >> arbitrary code on the driver side (an example of why this is useful is >> broadcast). In addition, it is somewhat awkward to tell developers that in >> order to implement an input source for Spark, they should learn the Hadoop >> MapReduce API first. >> >> >> My patch creates a new InputSource interface, described by: >> >> - an array of InputPartition that specifies the data partitioning >> - a RecordReader that specifies how data on each partition can be read >> >> This interface is similar to Hadoop's InputFormat, except that there is no >> explicit key/value separation. >> >> >> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-7025 >> Pull request: https://github.com/apache/spark/pull/5603 >> >