[
https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15559541#comment-15559541
]
Ofir Manor commented on SPARK-17815:
------------------------------------
As far as I understand, there is a clear single-point-of-truth: the structured
streaming "commit log" - the checkpoint. It holds both the source state
(offsets) and the Spark state (aggregations) of successfully finished batches
atomically, and is the one that is used during recovery to identify the correct
beginning offset in the source during recovery.
The structured WAL is a technical, internal implementation detail, that stores
an intention to process a range of offsets, before they are actually read.
Spark used it during recovery to repeat the same source end boundary to a
failed batch.
The data in the downstream store is about Spark output - which [version,spark
partition] have landed - not about source state. Of course, it is being used
during Spark recovery / retry, but not as a basis to choose a offsets in the
source (it is used to skip specific output version-partitions there were
already written).
As this ticket states, updating the Kafka consumer group offsets in Kafka is
only for easier progress monitoring using Kafka-specific tools. So, it should
be considered informational, after-the-fact updating just for being nice, as it
won't be used for Spark recovery. If a user want to manually recover, it should
rely on the Spark checkpoint offset.
In other words, updating Kafka offsets after a batch successfully commited
means that the offsets in Kafka represent which messages have been successfully
processed and landed in the sink, not which messages have been read.
[~marmbrus] Is my understanding correct?
> Report committed offsets
> ------------------------
>
> Key: SPARK-17815
> URL: https://issues.apache.org/jira/browse/SPARK-17815
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Michael Armbrust
>
> Since we manage our own offsets, we have turned off auto-commit. However,
> this means that external tools are not able to report on how far behind a
> given streaming job is. When the user manually gives us a group.id, we
> should report back to it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]