[jira] [Created] (SPARK-18154) CLONE - Change Source API so that sources do not need to keep unbounded state

Sunil Kumar (JIRA) Fri, 28 Oct 2016 00:12:35 -0700

Sunil Kumar created SPARK-18154:
-----------------------------------

             Summary: CLONE - Change Source API so that sources do not need to 
keep unbounded state
                 Key: SPARK-18154
                 URL: https://issues.apache.org/jira/browse/SPARK-18154
             Project: Spark
          Issue Type: Sub-task
          Components: Streaming
    Affects Versions: 2.0.0, 2.0.1
            Reporter: Sunil Kumar
            Assignee: Frederick Reiss
             Fix For: 2.0.3, 2.1.0



The version of the Source API in Spark 2.0.0 defines a single getBatch() method 
for fetching records from the source, with the following Scaladoc comments 
defining the semantics:

{noformat}
/**
 * Returns the data that is between the offsets (`start`, `end`]. When `start` 
is `None` then
 * the batch should begin with the first available record. This method must 
always return the
 * same data for a particular `start` and `end` pair.
 */
def getBatch(start: Option[Offset], end: Offset): DataFrame
{noformat}
These semantics mean that a Source must retain all past history for the stream 
that it backs. Further, a Source is also required to retain this data across 
restarts of the process where the Source is instantiated, even when the Source 
is restarted on a different machine.
These restrictions make it difficult to implement the Source API, as any 
implementation requires potentially unbounded amounts of distributed storage.
See the mailing list thread at 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html]
 for more information.
This JIRA will cover augmenting the Source API with an additional callback that 
will allow Structured Streaming scheduler to notify the source when it is safe 
to discard buffered data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-18154) CLONE - Change Source API so that sources do not need to keep unbounded state

Reply via email to