Re: Using Kafka and Flink for batch processing of a batch data source

Leith Mudge Wed, 20 Jul 2016 16:58:34 -0700

Thanks Milind & Till,

This is what I thought from my reading of the documentation but it is nice to 
have it confirmed by people more knowledgeable.


Supplementary to this question is whether Flink is the best choice for batch 
processing at this point in time or whether I would be better to look at a more 
mature and dedicated batch processing engine such as Spark? I do like the 
choices that adopting the unified programming model outlined in Apache 
Beam/Google Cloud Dataflow SDK and this purports to have runners for both Flink 
and Spark.

Regards,

Leith
From: Till Rohrmann <trohrm...@apache.org>
Date: Wednesday, 20 July 2016 at 5:05 PM
To: <user@flink.apache.org>
Subject: Re: Using Kafka and Flink for batch processing of a batch data source

At the moment there is also no batch source for Kafka. I'm also not so sure how 
you would define a batch given a Kafka stream. Only reading till a certain 
offset? Or maybe until one has read n messages?

I think it's best to write the batch data to HDFS or another batch data store.

Cheers,
Till

On Wed, Jul 20, 2016 at 8:08 AM, milind parikh 
<milindspar...@gmail.com<mailto:milindspar...@gmail.com>> wrote:

It likely does not make sense to publish a file ( "batch data") into Kafka; 
unless the file is very small.

An improvised pub-sub mechanism for Kafka could be to (a) write the file into a 
persistent store outside of kafka (b) publishing of a message into Kafka about 
that write so as to enable processing of that file.

If you really needed to have provenance around processing, you could route data 
processing through Nifi before Flink.

Regards
Milind

On Jul 19, 2016 9:37 PM, "Leith Mudge" 
<lei...@palamir.com<mailto:lei...@palamir.com>> wrote:

I am currently working on an architecture for a big data streaming and batch 
processing platform. I am planning on using Apache Kafka for a distributed 
messaging system to handle data from streaming data sources and then pass on to 
Apache Flink for stream processing. I would also like to use Flink's batch 
processing capabilities to process batch data.

Does it make sense to pass the batched data through Kafka on a periodic basis 
as a source for Flink batch processing (is this even possible?) or should I 
just write the batch data to a data store and then process by reading into 
Flink?

________________________________

| All rights in this email and any attached documents or files are expressly 
reserved. This e-mail, and any files transmitted with it, contains confidential 
information which may be subject to legal privilege. If you are not the 
intended recipient, please delete it and notify Palamir Pty Ltd by e-mail. 
Palamir Pty Ltd does not warrant this transmission or attachments are free from 
viruses or similar malicious code and does not accept liability for any 
consequences to the recipient caused by opening or using this e-mail. For the 
legal protection of our business, any email sent or received by us may be 
monitored or intercepted. | Please consider the environment before printing 
this email. |


________________________________

| All rights in this email and any attached documents or files are expressly 
reserved. This e-mail, and any files transmitted with it, contains confidential 
information which may be subject to legal privilege. If you are not the 
intended recipient, please delete it and notify Palamir Pty Ltd by e-mail. 
Palamir Pty Ltd does not warrant this transmission or attachments are free from 
viruses or similar malicious code and does not accept liability for any 
consequences to the recipient caused by opening or using this e-mail. For the 
legal protection of our business, any email sent or received by us may be 
monitored or intercepted. | Please consider the environment before printing 
this email. |

Re: Using Kafka and Flink for batch processing of a batch data source

Reply via email to