Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

Sean Owen Tue, 03 Mar 2015 15:08:45 -0800

This API reads a directory of files, not one file. A "file" here
really means a directory full of part-* files. You do not need to read
those separately.


Any syntax that works with Hadoop's FileInputFormat should work. I
thought you could specify a comma-separated list of paths? maybe I am
imagining that.

On Tue, Mar 3, 2015 at 10:57 PM, S. Zhou <myx...@yahoo.com.invalid> wrote:
> Thanks Ted. Actually a follow up question. I need to read multiple HDFS
> files into RDD. What I am doing now is: for each file I read them into a
> RDD. Then later on I union all these RDDs into one RDD. I am not sure if it
> is the best way to do it.
>
> Thanks
> Senqiang
>
>
> On Tuesday, March 3, 2015 2:40 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>
> Looking at scaladoc:
>
>  /** Get an RDD for a Hadoop file with an arbitrary new API InputFormat. */
>   def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]]
>
> Your conclusion is confirmed.
>
> On Tue, Mar 3, 2015 at 1:59 PM, S. Zhou <myx...@yahoo.com.invalid> wrote:
>
> I did some experiments and it seems not. But I like to get confirmation (or
> perhaps I missed something). If it does support, could u let me know how to
> specify multiple folders? Thanks.
>
> Senqiang
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

Reply via email to