Probably off-topic for Kafka list, but why do you think you need multiple copies of the file to parallelize access? You'll have parallel access based on how many containers you have on the machine (if you are using YARN-Spark).
On Mon, Mar 16, 2015 at 1:20 PM, Daniel Haviv <[email protected]> wrote: > Hi, > The reason we want to use this method is that this way a file can be consumed > by different streaming apps simultaneously (they just consume it's path from > kafka and open it locally). > > With fileStream to parallelize the processing of a specific file I will have > to make several copies of it, which wasteful in terms of space and time. > > Thanks, > Daniel > >> On 16 במרץ 2015, at 22:12, Gwen Shapira <[email protected]> wrote: >> >> Any reason not to use SparkStreaming directly with HDFS files, so >> you'll get locality guarantees from the Hadoop framework? >> StreamContext has textFileStream() method you could use for this. >> >> On Mon, Mar 16, 2015 at 12:46 PM, Daniel Haviv >> <[email protected]> wrote: >>> Hi, >>> Is it possible to assign specific partitions to specific nodes? >>> I want to upload files to HDFS, find out on which nodes the file resides >>> and then push their path into a topic and partition it by nodes. >>> This way I can ensure that the consumer (Spark Streaming) will consume both >>> the message and file locally. >>> >>> Can this be achieved ? >>> >>> Thanks, >>> Daniel
