[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15554426#comment-15554426
 ] 

Ofir Manor commented on SPARK-15406:
------------------------------------

Thanks Micahel for the checkpoint location explanation!  I didn't understand it 
after reading the docs. That should solve the "exactly-once" source restart 
challenge.

Regarding exactly-once to Kafka sink - we have implemented exactly that in our 
product (not within a Spark job). You are right that when retrying (and only 
when retrying), some scanning is needed (as I wrote, you need to know which 
messages of this [version,partition] have landed). A lot of it can be optimized 
though. Quick examples:
1. The Foreach sink open() call needs to recover that list (which messages have 
landed) only when retrying, so it is needed only in the first [version,*] after 
failure - that could be passed as a flag to open(). 
2. Hopefully the Structured Streaming checkpoint could hold the offset within 
the sink as part of the commited [version, partition] metadata, so that could 
significantly cut down the actual scanning (provide a starting point).
3. The actual scanning of the sink could be stopped (at least in a case of 
Kafka partition) once we read messages of [version+1], if I understand the 
Structured Streaming internals well enough (as Kafka partition provides order 
guarantee.

Anyway, the Kafka community have been discussing "exactly-once" as something 
that will arrive, and will likely deserve an "1.0" release, so maybe that would 
solve itself. I see Cody have more up-to-date insights on that...

> Structured streaming support for consuming from Kafka
> -----------------------------------------------------
>
>                 Key: SPARK-15406
>                 URL: https://issues.apache.org/jira/browse/SPARK-15406
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> ================== Old description =========================
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to