[
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553955#comment-15553955
]
Cody Koeninger commented on SPARK-15406:
----------------------------------------
As soon as you say "checkpoint", a whole class of people who were burned by the
problems with DStream checkpoints are going to tune out. I don't trust
anything unless I can get my hands on the offsets.
Regarding exactly once, this has two components, producing idempotently to
kafka, and storing idempotently/transactionally downstream. Specifically
regarding producing to Kafka, I was talking with Jay Kreps about this
yesterday. There have been a lot of ideas over the years (e.g.
put-with-expected-offset). The top contender at this point is apparently
something along the lines of a tcp/ip sequence number on the producer side
allowing for repeated writes to be ignored on the broker side. The KIP doc for
it should be coming "soon". Specifically regarding downstream write, without
an actual implementation for real downstream systems and no way to get offsets
except for individual messages, what exists currently couldn't even be tested
in production at my current gig. I could probably pretty quickly write a jdbc
sink and a Citus sink, but those seem like they would require significant
knowledge and / or assumptions about the job. Having a schema helps a lot, but
do you still want to assume things like a particular table format for outputs?
That those sinks only work with Kafka?
> Structured streaming support for consuming from Kafka
> -----------------------------------------------------
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
> Issue Type: New Feature
> Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source
> for Structured Streaming. Here is the design doc for an initial version of
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> ================== Old description =========================
> Structured streaming doesn't have support for kafka yet. I personally feel
> like time based indexing would make for a much better interface, but it's
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]