I'm also super interested in this. Flambo (our clojure DSL) wraps the java api and it would be great to have this.
On Tue, Apr 21, 2015 at 4:10 PM, Reynold Xin <r...@databricks.com> wrote: > It can reuse. That's a good point and we should document it in the API > contract. > > > On Tue, Apr 21, 2015 at 4:06 PM, Punyashloka Biswal < > punya.bis...@gmail.com> > wrote: > > > Reynold, thanks for this! At Palantir we're heavy users of the Java APIs > > and appreciate being able to stop hacking around with fake ClassTags :) > > > > Regarding this specific proposal, is the contract of RecordReader#get > > intended to be that it returns a fresh object each time? Or is it allowed > > to mutate a fixed object and return a pointer to it each time? > > > > Put another way, is a caller supposed to clone the output of get() if > they > > want to use it later? > > > > Punya > > > > On Tue, Apr 21, 2015 at 4:35 PM Reynold Xin <r...@databricks.com> wrote: > > > >> I created a pull request last night for a new InputSource API that is > >> essentially a stripped down version of the RDD API for providing data > into > >> Spark. Would be great to hear the community's feedback. > >> > >> Spark currently has two de facto input source API: > >> 1. RDD > >> 2. Hadoop MapReduce InputFormat > >> > >> Neither of the above is ideal: > >> > >> 1. RDD: It is hard for Java developers to implement RDD, given the > >> implicit > >> class tags. In addition, the RDD API depends on Scala's runtime library, > >> which does not preserve binary compatibility across Scala versions. If a > >> developer chooses Java to implement an input source, it would be great > if > >> that input source can be binary compatible in years to come. > >> > >> 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. > >> For example, it forces key-value semantics, and does not support running > >> arbitrary code on the driver side (an example of why this is useful is > >> broadcast). In addition, it is somewhat awkward to tell developers that > in > >> order to implement an input source for Spark, they should learn the > Hadoop > >> MapReduce API first. > >> > >> > >> My patch creates a new InputSource interface, described by: > >> > >> - an array of InputPartition that specifies the data partitioning > >> - a RecordReader that specifies how data on each partition can be read > >> > >> This interface is similar to Hadoop's InputFormat, except that there is > no > >> explicit key/value separation. > >> > >> > >> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-7025 > >> Pull request: https://github.com/apache/spark/pull/5603 > >> > > >