Those are empty partitions. I don't see the number of partitions specified in code. That then implies the default parallelism config is being used and is set to a very high number, the sum of empty + non empty files.
Regards Sab On 21-Nov-2015 11:59 pm, "Andy Davidson" <a...@santacruzintegration.com> wrote: > I start working on a very simple ETL pipeline for a POC. It reads a in a > data set of tweets stored as JSON strings on in HDFS and randomly selects > 1% of the observations and writes them to HDFS. It seems to run very > slowly. E.G. To write 4720 observations takes 1:06:46.577795. I > Also noticed that RDD saveAsTextFile is creating thousands of empty files. > > I assume creating all these empty files must be slowing down the system. Any > idea why this is happening? Do I have write a script to periodical remove > empty files? > > > Kind regards > > Andy > > tweetStrings = sc.textFile(inputDataURL) > > def removeEmptyLines(line) : > if line: > return True > else : > emptyLineCount.add(1); > return False > > emptyLineCount = sc.accumulator(0) > sample = (tweetStrings.filter(removeEmptyLines) > .sample(withReplacement=False, fraction=0.01, seed=345678)) > > startTime = datetime.datetime.now() > sample.saveAsTextFile(saveDataURL) > > endTime = datetime.datetime.now() > print("elapsed time:%s" % (datetime.datetime.now() - startTime)) > > elapsed time:1:06:46.577795 > > > > *Total number of empty files* > > $ hadoop fs -du {saveDataURL} | grep '^0' | wc –l > > 223515 > > *Total number of files with data* > > $ hadoop fs -du {saveDataURL} | grep –v '^0' | wc –l > > 4642 > > > I randomly pick a part file. It’s size is 9251 > >