[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

Ofir Manor (JIRA) Thu, 06 Oct 2016 17:02:42 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553613#comment-15553613
 ]


Ofir Manor commented on SPARK-15406:
------------------------------------

I see three somewhat-related issues:
1. More flexibility is generally needed in stating offsets when using a Kafka 
source - that moved to SPARK-17812.
2. Regarding exactly-once, what I'm missing (given a transactional / idempotent 
sink) is the ability to restart a failed structured streaming job (after it 
crashed or was killed) by submitting a new one and asking to "copy" the context 
(offset and internal state) of another job / query.
The programing guide hints that it is possible, but doesn't give an example 
(and I couldn't find a relevant method):
"In case of a failure or intentional shutdown, you can recover the previous 
progress and state of a previous query, and continue where it left off".
If that was possible, it would remove the need to manually monitor / manage 
source offsets for exactly-once.
However, this is not specific to Kafka source - it is relevant for all 
fault-tolerant sources.
3. Another related issue (though out-of-scope of this JIRA) is adding an 
"exactly-once" Kafka sink.
Since in Kafka we can't commit a batch of messages together (like in a Foreach 
sink), the open of [version,partition] can't just return a boolean - it should 
likely return the messages within a [version,partition] that were already 
written so only those will be filtered out (or otherwise filter within the 
partition before process() is called).
That again is not Kafka-specific, will be useful in other non-transactional 
sinks,



> Structured streaming support for consuming from Kafka
> -----------------------------------------------------
>
>                 Key: SPARK-15406
>                 URL: https://issues.apache.org/jira/browse/SPARK-15406
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> ================== Old description =========================
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

Reply via email to