Thanks Arun.
In our case, we only commit sink task output to the datastore by
coordinating with the driver.
Sink tasks write output to a "staging" area, and the driver only commits
the staging data to a datastore once all tasks for a micro-batch have
reported success back to the driver.
In the cas
Hi Eric,
I think it will depend on how you implement the sink and when the data in
the sink partitions are committed.
I think the batch can be repeated during task retries as well as if the
driver fails before the batch id is committed in sparks checkpoint. In the
first case may be the sink had n
Hi Arun, Gabor,
Thanks for the feedback. We are using the "Exactly-once using
transactional writes" approach, so we don't rely on message keys for
idempotent writes.
So I should clarify that my question is specific to the "Exactly-once using
transactional writes" approach.
We are following the ap
Yes! This is very helpful!
On Wed, Dec 5, 2018 at 9:21 PM Wenchen Fan wrote:
> great job! thanks a lot!
>
> On Thu, Dec 6, 2018 at 9:39 AM Hyukjin Kwon wrote:
>
>> It's merged now and in developer tools page -
>> http://spark.apache.org/developer-tools.html#individual-tests
>> Have some func wi
Hi,
I'd like to discuss the future of timestamp support in Spark, in particular
with respect of handling timezones in different SQL types. In a nutshell:
* There are at least 3 different ways of handling the timestamp type across
timezone changes
* We'd like Spark to clearly distinguish the 3 t
Hi Eric,
In order to have exactly-once one need re-playable source and idempotent
sink.
The cases what you've mentioned are covering the 2 main group of issues.
Practically any kind of programming problem can end-up in duplicated data
(even in the code which feeds kafka).
Don't know why have you a