Hi Anchit, Thanks for the quick answer.
my exact question is : I want to add HDFS location into each line in my JSON data. > On Sep 25, 2015, at 11:25, Anchit Choudhry <anchit.choud...@gmail.com> wrote: > > Hi Fengdong, > > Thanks for your question. > > Spark already has a function called wholeTextFiles within sparkContext which > can help you with that: > > Python > hdfs://a-hdfs-path/part-00000 > hdfs://a-hdfs-path/part-00001 > ... > hdfs://a-hdfs-path/part-nnnnn > rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”) > (a-hdfs-path/part-00000, its content) > (a-hdfs-path/part-00001, its content) > ... > (a-hdfs-path/part-nnnnn, its content) > More info: http://spark > <http://spark/>.apache.org/docs/latest/api/python/pyspark.html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles > > ------------ > > Scala > > val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path") > > More info: > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)] > > Let us know if this helps or you need more help. > > Thanks, > Anchit Choudhry > > On 24 September 2015 at 23:12, Fengdong Yu <fengdo...@everstring.com > <mailto:fengdo...@everstring.com>> wrote: > Hi, > > I have multiple files with JSON format, such as: > > /data/test1_data/sub100/test.data > /data/test2_data/sub200/test.data > > > I can sc.textFile(“/data/*/*”) > > but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save it > the one target HDFS location. > > how to do it, Thanks. > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > <mailto:dev-unsubscr...@spark.apache.org> > For additional commands, e-mail: dev-h...@spark.apache.org > <mailto:dev-h...@spark.apache.org> > >