Re: [discuss] new Java friendly InputSource API

Soren Macbeth Tue, 21 Apr 2015 16:34:34 -0700

I'm also super interested in this. Flambo (our clojure DSL) wraps the java
api and it would be great to have this.


On Tue, Apr 21, 2015 at 4:10 PM, Reynold Xin <r...@databricks.com> wrote:

> It can reuse. That's a good point and we should document it in the API
> contract.
>
>
> On Tue, Apr 21, 2015 at 4:06 PM, Punyashloka Biswal <
> punya.bis...@gmail.com>
> wrote:
>
> > Reynold, thanks for this! At Palantir we're heavy users of the Java APIs
> > and appreciate being able to stop hacking around with fake ClassTags :)
> >
> > Regarding this specific proposal, is the contract of RecordReader#get
> > intended to be that it returns a fresh object each time? Or is it allowed
> > to mutate a fixed object and return a pointer to it each time?
> >
> > Put another way, is a caller supposed to clone the output of get() if
> they
> > want to use it later?
> >
> > Punya
> >
> > On Tue, Apr 21, 2015 at 4:35 PM Reynold Xin <r...@databricks.com> wrote:
> >
> >> I created a pull request last night for a new InputSource API that is
> >> essentially a stripped down version of the RDD API for providing data
> into
> >> Spark. Would be great to hear the community's feedback.
> >>
> >> Spark currently has two de facto input source API:
> >> 1. RDD
> >> 2. Hadoop MapReduce InputFormat
> >>
> >> Neither of the above is ideal:
> >>
> >> 1. RDD: It is hard for Java developers to implement RDD, given the
> >> implicit
> >> class tags. In addition, the RDD API depends on Scala's runtime library,
> >> which does not preserve binary compatibility across Scala versions. If a
> >> developer chooses Java to implement an input source, it would be great
> if
> >> that input source can be binary compatible in years to come.
> >>
> >> 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive.
> >> For example, it forces key-value semantics, and does not support running
> >> arbitrary code on the driver side (an example of why this is useful is
> >> broadcast). In addition, it is somewhat awkward to tell developers that
> in
> >> order to implement an input source for Spark, they should learn the
> Hadoop
> >> MapReduce API first.
> >>
> >>
> >> My patch creates a new InputSource interface, described by:
> >>
> >> - an array of InputPartition that specifies the data partitioning
> >> - a RecordReader that specifies how data on each partition can be read
> >>
> >> This interface is similar to Hadoop's InputFormat, except that there is
> no
> >> explicit key/value separation.
> >>
> >>
> >> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-7025
> >> Pull request: https://github.com/apache/spark/pull/5603
> >>
> >
>

Re: [discuss] new Java friendly InputSource API

Reply via email to