Have you seen this thread ?

http://search-hadoop.com/m/q3RTt2uhMX1UhnCc1&subj=Re+Does+sc+newAPIHadoopFile+support+multiple+directories+or+nested+directories+

FYI

On Wed, Dec 9, 2015 at 11:18 AM, James Ding <jd...@palantir.com> wrote:

> Hi!
>
> My name is James, and I’m working on a question there doesn’t seem to be a
> lot of answers about online. I was hoping spark/hadoop gurus could shed
> some light on this.
>
> I have a data feed on NFS that looks like /foo////bar/.gz
>
> Currently I have a spark scala job that calls
>
> sparkContext.textFile("/foo/*/*/*/bar/*.gz")
>
> Upstream owners for the data feed have told me they may add additional
> nested directories or remove them from files relevant to me. In other
> words, files relevant to my spark job might sit on paths that look like:
>
>    - /foo/a/b/c/d/bar/*.gz
>    - /foo/a/b/bar/*.gz
>
> They will do this with only some files and without warning. Anyone have
> ideas on how I can configure spark to create an RDD from any textfiles that
> fit the pattern /foo/**/bar/*.gz where ** represents a variable number of
> wildcard directories?
>
> I'm working with on order of 10^5 and 10^6 files which has discouraged me
> from using something besides Hadoop fs API to walk the filesystem and feed
> that input to my spark job, but I'm open to suggestions here also.
>
> Thanks!
>
> James Ding
>

Reply via email to