RE: Maintaining data locality with list of paths (strings) as input

Emmanuel Sat, 14 Mar 2015 09:51:06 -0700

Hi guy,

I don't have an answer about flink but a couple comments on your use case I 
hope might help:


- you should view HDFS as a giant RAID across nodes: the namenode maintains the 
file table but the data is distributed and replicated across nodes by bloc. 
There is no 'data locality' guarantee: the dat is distributed and replicated so 
it could be spread across many nodes.

- small files on HDFS is not a good idea because the typical minimal bloc size 
is 64MB, which means even if your file is 1kB, it will use 64MB on disk. It is 
best to aggregate those small files into a big one or use a dB storage like 
hbase or cassandra.

In Spark, you can load files from local file system, but usually it requires 
that files be in each node which defeats the purpose.


Emmanuel



-------- Original message --------
From: Guy Rapaport <guy4...@gmail.com>
Date:03/14/2015  8:38 AM  (GMT-08:00)
To: Flink Users <user@flink.apache.org>
Subject: Maintaining data locality with list of paths (strings) as input

Hello,

Here's a use case I'd like to implement, and I wonder if Flink is the
answer:

My input is a file containing a list of paths.
(It's actually a message queue with incoming messages, each containing a
path, but let's use a simpler use-case.)

Each path points at a file stored on HDFS. The files are rather small so
although they are replicated, they are not broken into chunks.

I want each file to get processed on the note on which it is stored, for
the sake of data locality.
However, if I run such a job on Spark, what I get is that the input path
gets to some node, which should access the file by pulling it from the HDFS
- no data locality, but instead network congestion.

Can Flink solve this problem for me?

Note: I saw similar examples in which file lists are processed on Spark...
By having each file in the list downloaded from the internet to the node
processing it. That's not my use case - I already have the files on HDFS,
all I want is to enjoy data locality in a cluster-like environment!

Thanks,
>Guy.

RE: Maintaining data locality with list of paths (strings) as input

Reply via email to