Re: HDFS data locality and distribution

Chesnay Schepler Tue, 13 Mar 2018 04:40:21 -0700

Hello,

You said that "data is distributed very badly across slots"; do you meanthat only a small number of subtasks is reading from HDFS, or that thekeyed data is only processed by a few subtasks?

Flink does prioritize date locality over date distribution when readingthe files, but the function after the groupBy() should still make fulluse of the parallelism of the cluster. Do note that data skew can affecthow much data is distributed to each node, i.e. if 80% of your data hasthe same key (or rather hash), they will all end up on the same node.


On 12.03.2018 13:49, Reinier Kip wrote:

Relevant versions: Beam 2.1, Flink 1.3.

------------------------------------------------------------------------
*From:* Reinier Kip <r...@bol.com>
*Sent:* 12 March 2018 13:45:47
*To:* user@flink.apache.org
*Subject:* HDFS data locality and distribution

Hey all,
I'm trying to batch-process 30-ish files from HDFS, but I see thatdata is distributed very badly across slots. 4 out of 32 slots get4/5ths of the data, another 3 slots get about 1/5th and a last slotjust a few records. This probably triggers disk spillover on theseslots and slows down the job immensely. The data has many many uniquekeys and processing could be done in a highly parallel manner. Fromwhat I understand, HDFS data locality governs which splits areassigned to which subtask.
  * I'm running a Beam on Flink on YARN pipeline.
  * I'm reading 30-ish files, whose records are later grouped by
    their millions of unique keys.
  * For now, I have 8 task managers by 4 slots. Beam sets all subtasks
    to have 32 parallelism.
  * Data seems to be localised to 9 out of the 32 slots, 3 out of the
    8 task managers.
Does the statement of input split assignment ring true? Is the factthat data isn't redistributed an effort from Flink to have high datalocality, even if this means disk spillover for a few slots/tms andidleness for others? Is there any use for parallelism if work isn'tdistributed anyway?
Thanks for your time, Reinier

Re: HDFS data locality and distribution

Reply via email to