I am reading each file of a directory using wholeTextFiles. After that I am calling a function on each element of the rdd using map . The whole program uses just 50 lines of each file. The code is as below:def processFiles(fileNameContentsPair): fileName= fileNameContentsPair[0] result = "\n\n"+fileName resultEr = "\n\n"+fileName input = StringIO.StringIO(fileNameContentsPair[1]) reader = csv.reader(input,strict=True) try: i=0 for row in reader: if i==50: break // do some processing and get result string i=i+1 except csv.Error as e: resultEr = resultEr +"error occured\n\n" return resultEr return resultif __name__ == "__main__": inputFile = sys.argv[1] outputFile = sys.argv[2] sc = SparkContext(appName = "SomeApp") resultRDD = sc.wholeTextFiles(inputFile).map(processFiles) resultRDD.saveAsTextFile(outputFile)The size of each file of the directory can be very large in my case and because of this reason use of wholeTextFiles api will be inefficient in this case. Right now wholeTextFiles loads full file content into the memory. can we make wholeTextFiles to load only first 50 lines of each file ? Apart from using wholeTextFiles, other solution I can think of is iterating over each file of the directory one by one but that also seems to be inefficient. I am new to spark. Please let me know if there is any efficient way to do this.
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-large-size-files-from-a-directory-tp28669.html Sent from the Apache Spark User List mailing list archive at Nabble.com.