Anwar, Will try this as it might do exactly what I need. I will follow your pattern but use sc.textFile() for each file.
I am now thinking that I could start with an RDD of file paths and map it into (path, content) pairs, provided I could read a file on the server. Thank you, Oleg On 1 June 2014 18:41, Anwar Rizal <anriza...@gmail.com> wrote: > I presume that you need to have access to the path of each file you are > reading. > > I don't know whether there is a good way to do that for HDFS, I need to > read the files myself, something like: > > def openWithPath(inputPath: String, sc:SparkContext) = { > val fs = (new > Path(inputPath)).getFileSystem(sc.hadoopConfiguration) > val filesIt = fs.listFiles(path, false) > val paths = new ListBuffer[URI] > while (filesIt.hasNext) { > paths += filesIt.next.getPath.toUri > } > val withPaths = paths.toList.map{ p => > sc.newAPIHadoopFile[LongWritable, Text, > TextInputFormat](p.toString).map{ case (_,s) => (p, s.toString) } > } > withPaths.reduce{ _ ++ _ } > } > ... > > I would be interested if there is a better way to do the same thing ... > > Cheers, > a: > > > On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Could you provide an example of what you mean? >> >> I know it's possible to create an RDD from a path with wildcards, like in >> the subject. >> >> For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also >> provide a comma delimited list of paths. >> >> Nick >> >> 2014년 6월 1일 일요일, Oleg Proudnikov<oleg.proudni...@gmail.com>님이 작성한 메시지: >> >> Hi All, >>> >>> Is it possible to create an RDD from a directory tree of the following >>> form? >>> >>> RDD[(PATH, Seq[TEXT])] >>> >>> Thank you, >>> Oleg >>> >>> > -- Kind regards, Oleg