Using Flink Streaming to write to multiple output files in HDFS

Andra Lungu Wed, 21 Oct 2015 02:41:30 -0700

Hey guys,

Long time, no see :). I recently started a new job and it involves
performing a set of real-time data analytics using Apache Kafka, Storm
and Flume.


What happens, on a very high level, is that set of signals is
collected, stored into a Kafka topic and then Storm is used to filter
certain fields out or to enrich the fields with other
meta-information. Finally, Flume writes the output into mutiple HDFS
files depending on the date, hour etc.

Now, I saw that Flink can play with a similar pipeline, but without
needing Flume for the writing to HDFS part (see
http://data-artisans.com/kafka-flink-a-practical-how-to/). Which
brings me to my question: jow does Flink handle writing to multiple
files in a streaming fashion? -until now, I was playing with batch and
writeAsCsv just took one file as a parameter-

Next question: What are the prerequisites to deploy a Flink Streaming
job on a cluster? Yarn, HDFS, anything else?

Final question, more of a request: I'd like to play around with Flink
Streaming to state whether it can substitute Storm in this use case
and whether it can outrun it :P. To this end, I'll need some starting
points: docs, blog posts, examples to read. Any input would be useful.

I wanted to dig for a newbie task in the streaming area, but I could
not find one... can we think of something easy to get me started?

Thanks! Hope you guys had fun at Flink Forward!
Andra

Using Flink Streaming to write to multiple output files in HDFS

Reply via email to