Re: Checkpoints in batch processing & JDBC Output Format

Stephan Ewen Wed, 11 Nov 2015 02:35:59 -0800

Hi!

You can use both the DataSet API or the DataStream API for that. In case of
failures, they would behave slightly differently.

DataSet:

Fault tolerance for the DataSet API works by restarting the job and redoing
all of the work. In some sense, that is similar to what happens in
MapReduce, only that Flink currently restarts more tasks than strictly
necessary (work in progress to reduce that). The periodic in-flight
checkpoints are not used here.

DataStream:

This one would start immediately inserting data (as it is a streaming job),
and draw periodic checkpoints that make sure replay-on-failure only has to
redo only a bit, not everything.Whether this fits your use case depends on
the type of processing you want to do.
You could even use this job in a way that it monitors the directory for new
files, picks them up, and starts immediate insertion into the database when
they appear.

Considering the last question (JDBC output format): Using UPSERT needs a
few modifications (issue that another user had), you would probably have to
write a custom output format that would be based on the JDBC output format.

If you go with the streaming API, it should be possible to change the
database writing output format to give you exactly-once semantics. The way
to do that would be to commit the upserts only on completed checkpoints
(and buffer them in the sink between checkpoints). This may be interesting
if your database cannot deduplicate insertions (no deterministic primary
key).

Greetings,
Stephan

On Mon, Nov 9, 2015 at 5:25 PM, Maximilian Bode <maximilian.b...@tngtech.com
> wrote:

> Hi everyone,
>
> I am considering using Flink in a project. The setting would be a YARN
> cluster where data is first read in from HDFS, then processed and finally
> written into an Oracle database using an upsert command. If I understand
> the documentation correctly, the DataSet API would be the natural candidate
> for this problem.
>
> My first question is about the checkpointing system. Apparently (e.g. [1]
> and [2]) it does not apply to batch processing. So how does Flink handle
> failures during batch processing? For the use case described above, 'at
> least once' semantics would suffice – still, are 'exactly once' guarantees
> possible?
> For example, how does Flink handle a failure of one taskmanager during a
> batch process? What happens in this case, if the data has already partly
> been written to the database?
>
> Secondly, the most obvious, straight-forward approach of connecting to the
> Oracle DB would be the JDBC Output Format. In [3], it was mentioned that it
> does not have many users and might not be trusted. What is the status on
> this?
>
> Best regards,
> Max
>
> [1]
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-Spark-tp583p587.html
> [2]
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Batch-Processing-as-Streaming-td1909.html
> [3]
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/PotsgreSQL-JDBC-Sink-quot-writeRecord-failed-quot-and-quot-Batch-element-cancelled-quot-on-upsert-td623.html
>

Re: Checkpoints in batch processing & JDBC Output Format

Reply via email to