Those are empty partitions. I don't see the number of partitions specified
in code. That then implies the default parallelism config is being used and
is set to a very high number, the sum of empty + non empty files.

Regards
Sab
On 21-Nov-2015 11:59 pm, "Andy Davidson" <a...@santacruzintegration.com>
wrote:

> I start working on a very simple ETL pipeline for a POC. It reads a in a
> data set of tweets stored as JSON strings on in HDFS and randomly selects
> 1% of the observations and writes them to HDFS. It seems to run very
> slowly. E.G. To write 4720 observations takes 1:06:46.577795. I
> Also noticed that RDD saveAsTextFile is creating thousands of empty files.
>
> I assume creating all these empty files must be slowing down the system. Any
> idea why this is happening? Do I have write a script to periodical remove
> empty files?
>
>
> Kind regards
>
> Andy
>
> tweetStrings = sc.textFile(inputDataURL)
>
> def removeEmptyLines(line) :
>     if line:
>         return True
>     else :
>         emptyLineCount.add(1);
>         return False
>
> emptyLineCount = sc.accumulator(0)
> sample = (tweetStrings.filter(removeEmptyLines)
>           .sample(withReplacement=False, fraction=0.01, seed=345678))
>
> startTime = datetime.datetime.now()
> sample.saveAsTextFile(saveDataURL)
>
> endTime = datetime.datetime.now()
> print("elapsed time:%s" % (datetime.datetime.now() - startTime))
>
> elapsed time:1:06:46.577795
>
>
>
> *Total number of empty files*
>
> $ hadoop fs -du {saveDataURL} | grep '^0' | wc –l
>
> 223515
>
> *Total number of files with data*
>
> $ hadoop fs -du {saveDataURL} | grep –v '^0' | wc –l
>
> 4642
>
>
> I randomly pick a part file. It’s size is 9251
>
>

Reply via email to