Github user detonator413 commented on the pull request: https://github.com/apache/flink/pull/1090#issuecomment-137480178 Hi Max, Look at the distcp utility (http://hadoop.apache.org/docs/r1.2.1/distcp.html <http://hadoop.apache.org/docs/r1.2.1/distcp.html>). The purpose of it is to copy big amount of files within one cluster or between clusters. In local mode the tool will also work for local FS, whereas in the distributed mode only HDFS paths are supposed to be used. I made a simple benchmark on copying 800GB of data within one cluster running Hadoop distcp (using default distcp input format ) and Flink distcp in parallel. Flink job was 1.5 minutes faster (it took approximately 35 minutes in our setup). Slava > On 03 Sep 2015, at 17:00, Max <notificati...@github.com> wrote: > > Thanks for your pull request! I'm assuming you would use this utility to copy files from your local to a remote file system, right? Your utility starts a Flink job to copy the files to the remote file systems. This only works if you execute it locally because otherwise the task managers need to have the files available and that might defeat the utility's purpose. Also, imagine someone embedding the tool in a Flink program. The person might wonder why his/her program actually executes two jobs (one for the utility, one for the actual job). > > I think this would be more useful as a utility function, e.g. in a FileUtils class in flink-core. The method there would receive a list of files and then upload the files like you did using Flink's FileSystem abstraction. We could still parallelize the method by starting multiple threads to upload the files. > > Correct me if I'm wrong or misunderstood your pull request :) > > â > Reply to this email directly or view it on GitHub <https://github.com/apache/flink/pull/1090#issuecomment-137477152>. >
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---