Hi Fengdong, Thanks for your question.
Spark already has a function called wholeTextFiles within sparkContext which can help you with that: Python hdfs://a-hdfs-path/part-00000hdfs://a-hdfs-path/part-00001 ...hdfs://a-hdfs-path/part-nnnnn rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”) (a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content) More info: http://spark.apache.org/docs/latest/api/python/pyspark .html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles ------------ Scala val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path") More info: https://spark.apache.org/docs/latest/api/scala/index.html#org. apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)] Let us know if this helps or you need more help. Thanks, Anchit Choudhry On 24 September 2015 at 23:12, Fengdong Yu <fengdo...@everstring.com> wrote: > Hi, > > I have multiple files with JSON format, such as: > > /data/test1_data/sub100/test.data > /data/test2_data/sub200/test.data > > > I can sc.textFile(“/data/*/*”) > > but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save > it the one target HDFS location. > > how to do it, Thanks. > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >