I am reading each file of a directory using wholeTextFiles. After that I am
calling a function on each element of the rdd using map . The whole program
uses just 50 lines of each file. The code is as below:def
processFiles(fileNameContentsPair):  fileName= fileNameContentsPair[0] 
result = "\n\n"+fileName  resultEr = "\n\n"+fileName  input =
StringIO.StringIO(fileNameContentsPair[1])  reader =
csv.reader(input,strict=True)  try:       i=0       for row in reader:        
if i==50:           break         // do some processing and get result
string         i=i+1  except csv.Error as e:    resultEr = resultEr +"error
occured\n\n"    return resultEr  return resultif __name__ == "__main__": 
inputFile = sys.argv[1]  outputFile = sys.argv[2]  sc = SparkContext(appName
= "SomeApp")  resultRDD = sc.wholeTextFiles(inputFile).map(processFiles) 
resultRDD.saveAsTextFile(outputFile)The size of each file of the directory
can be very large in my case and because of this reason use of
wholeTextFiles api will be inefficient in this case. Right now
wholeTextFiles loads full file content into the memory. can we make
wholeTextFiles to load only first 50 lines of each file ? Apart from using
wholeTextFiles, other solution I can think of is iterating over each file of
the directory one by one but that also seems to be inefficient. I am new to
spark. Please let me know if there is any efficient way to do this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-large-size-files-from-a-directory-tp28669.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to