Have you seen this thread ? http://search-hadoop.com/m/q3RTt2uhMX1UhnCc1&subj=Re+Does+sc+newAPIHadoopFile+support+multiple+directories+or+nested+directories+
FYI On Wed, Dec 9, 2015 at 11:18 AM, James Ding <jd...@palantir.com> wrote: > Hi! > > My name is James, and I’m working on a question there doesn’t seem to be a > lot of answers about online. I was hoping spark/hadoop gurus could shed > some light on this. > > I have a data feed on NFS that looks like /foo////bar/.gz > > Currently I have a spark scala job that calls > > sparkContext.textFile("/foo/*/*/*/bar/*.gz") > > Upstream owners for the data feed have told me they may add additional > nested directories or remove them from files relevant to me. In other > words, files relevant to my spark job might sit on paths that look like: > > - /foo/a/b/c/d/bar/*.gz > - /foo/a/b/bar/*.gz > > They will do this with only some files and without warning. Anyone have > ideas on how I can configure spark to create an RDD from any textfiles that > fit the pattern /foo/**/bar/*.gz where ** represents a variable number of > wildcard directories? > > I'm working with on order of 10^5 and 10^6 files which has discouraged me > from using something besides Hadoop fs API to walk the filesystem and feed > that input to my spark job, but I'm open to suggestions here also. > > Thanks! > > James Ding >