In your case, maybe you can try to call the function coalesce?

Good luck,

Xiao Li

2015-11-23 12:15 GMT-08:00 Andy Davidson <a...@santacruzintegration.com>:

> Hi Sabarish
>
> I am but a simple padawan :-) I do not understand your answer. Why would
> Spark be creating so many empty partitions? My real problem is my
> application is very slow. I happened to notice thousands of empty files
> being created. I thought this is a hint to why my app is slow.
>
> My program calls sample( 0.01).filter(not null).saveAsTextFile(). This
> takes about 35 min, to scan 500,000 JSON strings and write 5000 to disk.
> The total data writing in 38M.
>
> The data is read from HDFS. My understanding is Spark can not know in
> advance how HDFS partitioned the data. Spark knows I have a master and 3
> slaves machines. It knows how many works/executors are assigned to my Job.
> I would expect spark would be smart enough not create more partitions than
> I have worker machines?
>
> Also given I am not using any key/value operations like Join() or doing
> multiple scans I would assume my app would not benefit from partitioning.
>
>
> Kind regards
>
> Andy
>
>
> From: Sabarish Sasidharan <sabarish.sasidha...@manthan.com>
> Date: Saturday, November 21, 2015 at 7:20 PM
> To: Andrew Davidson <a...@santacruzintegration.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: newbie : why are thousands of empty files being created on
> HDFS?
>
> Those are empty partitions. I don't see the number of partitions specified
> in code. That then implies the default parallelism config is being used and
> is set to a very high number, the sum of empty + non empty files.
>
> Regards
> Sab
> On 21-Nov-2015 11:59 pm, "Andy Davidson" <a...@santacruzintegration.com>
> wrote:
>
>> I start working on a very simple ETL pipeline for a POC. It reads a in a
>> data set of tweets stored as JSON strings on in HDFS and randomly selects
>> 1% of the observations and writes them to HDFS. It seems to run very
>> slowly. E.G. To write 4720 observations takes 1:06:46.577795. I
>> Also noticed that RDD saveAsTextFile is creating thousands of empty
>> files.
>>
>> I assume creating all these empty files must be slowing down the system. Any
>> idea why this is happening? Do I have write a script to periodical remove
>> empty files?
>>
>>
>> Kind regards
>>
>> Andy
>>
>> tweetStrings = sc.textFile(inputDataURL)
>>
>> def removeEmptyLines(line) :
>>     if line:
>>         return True
>>     else :
>>         emptyLineCount.add(1);
>>         return False
>>
>> emptyLineCount = sc.accumulator(0)
>> sample = (tweetStrings.filter(removeEmptyLines)
>>           .sample(withReplacement=False, fraction=0.01, seed=345678))
>>
>> startTime = datetime.datetime.now()
>> sample.saveAsTextFile(saveDataURL)
>>
>> endTime = datetime.datetime.now()
>> print("elapsed time:%s" % (datetime.datetime.now() - startTime))
>>
>> elapsed time:1:06:46.577795
>>
>>
>>
>> *Total number of empty files*
>>
>> $ hadoop fs -du {saveDataURL} | grep '^0' | wc –l
>>
>> 223515
>>
>> *Total number of files with data*
>>
>> $ hadoop fs -du {saveDataURL} | grep –v '^0' | wc –l
>>
>> 4642
>>
>>
>> I randomly pick a part file. It’s size is 9251
>>
>>

Reply via email to