Re: How to get the HDFS path for each RDD

Fengdong Yu Thu, 24 Sep 2015 20:45:19 -0700

Hi Anchit, 

Thanks for the quick answer.


my exact question is : I want to add HDFS location into each line in my JSON  
data.


> On Sep 25, 2015, at 11:25, Anchit Choudhry <anchit.choud...@gmail.com> wrote:
> 
> Hi Fengdong,
> 
> Thanks for your question.
> 
> Spark already has a function called wholeTextFiles within sparkContext which 
> can help you with that:
> 
> Python
> hdfs://a-hdfs-path/part-00000
> hdfs://a-hdfs-path/part-00001
> ...
> hdfs://a-hdfs-path/part-nnnnn
> rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”)
> (a-hdfs-path/part-00000, its content)
> (a-hdfs-path/part-00001, its content)
> ...
> (a-hdfs-path/part-nnnnn, its content)
> More info: http://spark 
> <http://spark/>.apache.org/docs/latest/api/python/pyspark.html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles
> 
> ------------
> 
> Scala
> 
> val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")
> 
> More info: 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)]
>  
> Let us know if this helps or you need more help.
> 
> Thanks,
> Anchit Choudhry
> 
> On 24 September 2015 at 23:12, Fengdong Yu <fengdo...@everstring.com 
> <mailto:fengdo...@everstring.com>> wrote:
> Hi,
> 
> I have  multiple files with JSON format, such as:
> 
> /data/test1_data/sub100/test.data
> /data/test2_data/sub200/test.data
> 
> 
> I can sc.textFile(“/data/*/*”)
> 
> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save it 
> the one target HDFS location.
> 
> how to do it, Thanks.
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> For additional commands, e-mail: dev-h...@spark.apache.org 
> <mailto:dev-h...@spark.apache.org>
> 
>

Re: How to get the HDFS path for each RDD

Reply via email to