I start working on a very simple ETL pipeline for a POC. It reads a in a data set of tweets stored as JSON strings on in HDFS and randomly selects 1% of the observations and writes them to HDFS. It seems to run very slowly. E.G. To write 4720 observations takes 1:06:46.577795. I Also noticed that RDD saveAsTextFile is creating thousands of empty files.
I assume creating all these empty files must be slowing down the system. Any idea why this is happening? Do I have write a script to periodical remove empty files? Kind regards Andy tweetStrings = sc.textFile(inputDataURL) def removeEmptyLines(line) : if line: return True else : emptyLineCount.add(1); return False emptyLineCount = sc.accumulator(0) sample = (tweetStrings.filter(removeEmptyLines) .sample(withReplacement=False, fraction=0.01, seed=345678)) startTime = datetime.datetime.now() sample.saveAsTextFile(saveDataURL) endTime = datetime.datetime.now() print("elapsed time:%s" % (datetime.datetime.now() - startTime)) elapsed time:1:06:46.577795 Total number of empty files $ hadoop fs -du {saveDataURL} | grep '^0' | wc l 223515 Total number of files with data $ hadoop fs -du {saveDataURL} | grep v '^0' | wc l 4642 I randomly pick a part file. It¹s size is 9251