[
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15554426#comment-15554426
]
Ofir Manor commented on SPARK-15406:
------------------------------------
Thanks Micahel for the checkpoint location explanation! I didn't understand it
after reading the docs. That should solve the "exactly-once" source restart
challenge.
Regarding exactly-once to Kafka sink - we have implemented exactly that in our
product (not within a Spark job). You are right that when retrying (and only
when retrying), some scanning is needed (as I wrote, you need to know which
messages of this [version,partition] have landed). A lot of it can be optimized
though. Quick examples:
1. The Foreach sink open() call needs to recover that list (which messages have
landed) only when retrying, so it is needed only in the first [version,*] after
failure - that could be passed as a flag to open().
2. Hopefully the Structured Streaming checkpoint could hold the offset within
the sink as part of the commited [version, partition] metadata, so that could
significantly cut down the actual scanning (provide a starting point).
3. The actual scanning of the sink could be stopped (at least in a case of
Kafka partition) once we read messages of [version+1], if I understand the
Structured Streaming internals well enough (as Kafka partition provides order
guarantee.
Anyway, the Kafka community have been discussing "exactly-once" as something
that will arrive, and will likely deserve an "1.0" release, so maybe that would
solve itself. I see Cody have more up-to-date insights on that...
> Structured streaming support for consuming from Kafka
> -----------------------------------------------------
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
> Issue Type: New Feature
> Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source
> for Structured Streaming. Here is the design doc for an initial version of
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> ================== Old description =========================
> Structured streaming doesn't have support for kafka yet. I personally feel
> like time based indexing would make for a much better interface, but it's
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]