Hello, Here's a use case I'd like to implement, and I wonder if Flink is the answer:
My input is a file containing a list of paths. (It's actually a message queue with incoming messages, each containing a path, but let's use a simpler use-case.) Each path points at a file stored on HDFS. The files are rather small so although they are replicated, they are not broken into chunks. I want each file to get processed on the note on which it is stored, for the sake of data locality. However, if I run such a job on Spark, what I get is that the input path gets to some node, which should access the file by pulling it from the HDFS - no data locality, but instead network congestion. Can Flink solve this problem for me? Note: I saw similar examples in which file lists are processed on Spark... By having each file in the list downloaded from the internet to the node processing it. That's not my use case - I already have the files on HDFS, all I want is to enjoy data locality in a cluster-like environment! Thanks, >Guy.