Re: Maintaining data locality with list of paths (strings) as input

2015-03-15 Thread Robert Metzger
Hi, @Emmanuel: "Is the Flink behavior mentioned native or is this something happening when running Flink on YARN?" The input split assignment behavior Stephan described is implemented into Flink, so it works in a stanalone Flink cluster and in a YARN setup. In a setup where each machine running a

Re: Maintaining data locality with list of paths (strings) as input

2015-03-14 Thread Guy Rapaport
Hi Stephan, The case is this: I have lots of images stored on a cluster, and I want to create a system in which I send a message (to a message queue: let's say Apache Kafka) and the message is accepted within the cluster and processed. The message contains the ID of one of the images (or even its

RE: Maintaining data locality with list of paths (strings) as input

2015-03-14 Thread Emmanuel
aining data locality with list of paths (strings) as input From: se...@apache.org To: user@flink.apache.org Hi Guy, This sounds like a use case that should workwith Flink. When it comes to input handling, Flink differs a bit from Spark. Flink creates a set of input tasks and a set of input splits.

Re: Maintaining data locality with list of paths (strings) as input

2015-03-14 Thread Stephan Ewen
Hi Guy, This sounds like a use case that should workwith Flink. When it comes to input handling, Flink differs a bit from Spark. Flink creates a set of input tasks and a set of input splits. The splits are then on-the-fly assigned to the tasks. Each task may work on multiple input spits, which ar

RE: Maintaining data locality with list of paths (strings) as input

2015-03-14 Thread Emmanuel
Hi guy, I don't have an answer about flink but a couple comments on your use case I hope might help: - you should view HDFS as a giant RAID across nodes: the namenode maintains the file table but the data is distributed and replicated across nodes by bloc. There is no 'data locality' guarantee