In your case, maybe you can try to call the function coalesce? Good luck,
Xiao Li 2015-11-23 12:15 GMT-08:00 Andy Davidson <a...@santacruzintegration.com>: > Hi Sabarish > > I am but a simple padawan :-) I do not understand your answer. Why would > Spark be creating so many empty partitions? My real problem is my > application is very slow. I happened to notice thousands of empty files > being created. I thought this is a hint to why my app is slow. > > My program calls sample( 0.01).filter(not null).saveAsTextFile(). This > takes about 35 min, to scan 500,000 JSON strings and write 5000 to disk. > The total data writing in 38M. > > The data is read from HDFS. My understanding is Spark can not know in > advance how HDFS partitioned the data. Spark knows I have a master and 3 > slaves machines. It knows how many works/executors are assigned to my Job. > I would expect spark would be smart enough not create more partitions than > I have worker machines? > > Also given I am not using any key/value operations like Join() or doing > multiple scans I would assume my app would not benefit from partitioning. > > > Kind regards > > Andy > > > From: Sabarish Sasidharan <sabarish.sasidha...@manthan.com> > Date: Saturday, November 21, 2015 at 7:20 PM > To: Andrew Davidson <a...@santacruzintegration.com> > Cc: "user @spark" <user@spark.apache.org> > Subject: Re: newbie : why are thousands of empty files being created on > HDFS? > > Those are empty partitions. I don't see the number of partitions specified > in code. That then implies the default parallelism config is being used and > is set to a very high number, the sum of empty + non empty files. > > Regards > Sab > On 21-Nov-2015 11:59 pm, "Andy Davidson" <a...@santacruzintegration.com> > wrote: > >> I start working on a very simple ETL pipeline for a POC. It reads a in a >> data set of tweets stored as JSON strings on in HDFS and randomly selects >> 1% of the observations and writes them to HDFS. It seems to run very >> slowly. E.G. To write 4720 observations takes 1:06:46.577795. I >> Also noticed that RDD saveAsTextFile is creating thousands of empty >> files. >> >> I assume creating all these empty files must be slowing down the system. Any >> idea why this is happening? Do I have write a script to periodical remove >> empty files? >> >> >> Kind regards >> >> Andy >> >> tweetStrings = sc.textFile(inputDataURL) >> >> def removeEmptyLines(line) : >> if line: >> return True >> else : >> emptyLineCount.add(1); >> return False >> >> emptyLineCount = sc.accumulator(0) >> sample = (tweetStrings.filter(removeEmptyLines) >> .sample(withReplacement=False, fraction=0.01, seed=345678)) >> >> startTime = datetime.datetime.now() >> sample.saveAsTextFile(saveDataURL) >> >> endTime = datetime.datetime.now() >> print("elapsed time:%s" % (datetime.datetime.now() - startTime)) >> >> elapsed time:1:06:46.577795 >> >> >> >> *Total number of empty files* >> >> $ hadoop fs -du {saveDataURL} | grep '^0' | wc –l >> >> 223515 >> >> *Total number of files with data* >> >> $ hadoop fs -du {saveDataURL} | grep –v '^0' | wc –l >> >> 4642 >> >> >> I randomly pick a part file. It’s size is 9251 >> >>