Re: [discuss] new Java friendly InputSource API

Reynold Xin Tue, 21 Apr 2015 16:11:14 -0700

It can reuse. That's a good point and we should document it in the API
contract.



On Tue, Apr 21, 2015 at 4:06 PM, Punyashloka Biswal <punya.bis...@gmail.com>
wrote:

> Reynold, thanks for this! At Palantir we're heavy users of the Java APIs
> and appreciate being able to stop hacking around with fake ClassTags :)
>
> Regarding this specific proposal, is the contract of RecordReader#get
> intended to be that it returns a fresh object each time? Or is it allowed
> to mutate a fixed object and return a pointer to it each time?
>
> Put another way, is a caller supposed to clone the output of get() if they
> want to use it later?
>
> Punya
>
> On Tue, Apr 21, 2015 at 4:35 PM Reynold Xin <r...@databricks.com> wrote:
>
>> I created a pull request last night for a new InputSource API that is
>> essentially a stripped down version of the RDD API for providing data into
>> Spark. Would be great to hear the community's feedback.
>>
>> Spark currently has two de facto input source API:
>> 1. RDD
>> 2. Hadoop MapReduce InputFormat
>>
>> Neither of the above is ideal:
>>
>> 1. RDD: It is hard for Java developers to implement RDD, given the
>> implicit
>> class tags. In addition, the RDD API depends on Scala's runtime library,
>> which does not preserve binary compatibility across Scala versions. If a
>> developer chooses Java to implement an input source, it would be great if
>> that input source can be binary compatible in years to come.
>>
>> 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive.
>> For example, it forces key-value semantics, and does not support running
>> arbitrary code on the driver side (an example of why this is useful is
>> broadcast). In addition, it is somewhat awkward to tell developers that in
>> order to implement an input source for Spark, they should learn the Hadoop
>> MapReduce API first.
>>
>>
>> My patch creates a new InputSource interface, described by:
>>
>> - an array of InputPartition that specifies the data partitioning
>> - a RecordReader that specifies how data on each partition can be read
>>
>> This interface is similar to Hadoop's InputFormat, except that there is no
>> explicit key/value separation.
>>
>>
>> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-7025
>> Pull request: https://github.com/apache/spark/pull/5603
>>
>

Re: [discuss] new Java friendly InputSource API

Reply via email to