I start working on a very simple ETL pipeline for a POC. It reads a in a
data set of tweets stored as JSON strings on in HDFS and randomly selects 1%
of the observations and writes them to HDFS. It seems to run very slowly.
E.G. To write 4720 observations takes 1:06:46.577795. I
Also noticed that RDD saveAsTextFile is creating thousands of empty files.

I assume creating all these empty files must be slowing down the system. Any
idea why this is happening? Do I have write a script to periodical remove
empty files?


Kind regards

Andy

tweetStrings = sc.textFile(inputDataURL)


def removeEmptyLines(line) :
    if line:
        return True
    else :
        emptyLineCount.add(1);
        return False

emptyLineCount = sc.accumulator(0)
sample = (tweetStrings.filter(removeEmptyLines)
          .sample(withReplacement=False, fraction=0.01, seed=345678))

startTime = datetime.datetime.now()
sample.saveAsTextFile(saveDataURL)

endTime = datetime.datetime.now()
print("elapsed time:%s" % (datetime.datetime.now() - startTime))

elapsed time:1:06:46.577795


Total number of empty files
$ hadoop fs -du {saveDataURL} | grep '^0' | wc ­l
223515

Total number of files with data
$ hadoop fs -du {saveDataURL} | grep ­v '^0' | wc ­l
4642



I randomly pick a part file. It¹s size is 9251


Reply via email to