Re: HDFS data locality and distribution

2018-03-19 Thread Reinier Kip
__ From: Chesnay Schepler Sent: 13 March 2018 12:40:02 To: user@flink.apache.org Subject: Re: HDFS data locality and distribution Hello, You said that "data is distributed very badly across slots"; do you mean that only a small number of subtasks is reading from HDFS, or

Re: HDFS data locality and distribution

2018-03-13 Thread Chesnay Schepler
Hello, You said that "data is distributed very badly across slots"; do you mean that only a small number of subtasks is reading from HDFS, or that the keyed data is only processed by a few subtasks? Flink does prioritize date locality over date distribution when reading the files, but the fu

Re: HDFS data locality and distribution

2018-03-12 Thread Reinier Kip
Relevant versions: Beam 2.1, Flink 1.3. From: Reinier Kip Sent: 12 March 2018 13:45:47 To: user@flink.apache.org Subject: HDFS data locality and distribution Hey all, I'm trying to batch-process 30-ish files from HDFS, but I see that data is distributed very