Re: How to get the HDFS path for each RDD

Anchit Choudhry Thu, 24 Sep 2015 20:27:01 -0700

Hi Fengdong,

Thanks for your question.

Spark already has a function called wholeTextFiles within sparkContext
which can help you with that:

Python

hdfs://a-hdfs-path/part-00000hdfs://a-hdfs-path/part-00001
...hdfs://a-hdfs-path/part-nnnnn

rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”)

(a-hdfs-path/part-00000, its content)
(a-hdfs-path/part-00001, its content)
...
(a-hdfs-path/part-nnnnn, its content)

More info: http://spark.apache.org/docs/latest/api/python/pyspark
.html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles

------------

Scala

val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")

More info: https://spark.apache.org/docs/latest/api/scala/index.html#org.
apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)]

Let us know if this helps or you need more help.

Thanks,
Anchit Choudhry

On 24 September 2015 at 23:12, Fengdong Yu <[email protected]> wrote:

> Hi,
>
> I have  multiple files with JSON format, such as:
>
> /data/test1_data/sub100/test.data
> /data/test2_data/sub200/test.data
>
>
> I can sc.textFile(“/data/*/*”)
>
> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save
> it the one target HDFS location.
>
> how to do it, Thanks.
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: How to get the HDFS path for each RDD

Reply via email to