Understanding and optimizing spark disk usage during a job.

Jaonary Rabarisoa Fri, 28 Nov 2014 09:17:15 -0800

Dear all,

I have a job that crashes before its end because of no space left on
device, and I noticed that this job generates a lots of temporary data on
my disk.


To be precise, the job is a simple map job that takes a set of images,
extracts local features and save these local features as a sequence file.
My images are represented as a key value pair where the key are strings
representing the id of the image (the filename) and the values are the
base64 encoding of the images.

To extract the features, I use an external c program that I call with
RDD.pipe. I stream the base64 image to the c program and it sends back the
extracted feature vectors through stdout. Each line represents one feature
vector from the current image. I don't use any serialization library, I
just write the feature vector element on the stdout separated by space.
Once in spark, I just split the line and create a scala vector from each
value and save my sequence file.

The overall job looks like the following :

val images: RDD[(String, String) = ...
val features: RDD[(String, Vector)] = images.pipe(...).map(_split(" ")...)
features.saveAsSequenceFile(...)

The problem is that for about 3G of image data (about 12000 images) this
job generates more than 180G of temporary data. It seems to be strange
since for each image I have about 4000 double feature vectors of dimension
400.

I run the job on my laptop for test purpose that why I can't add additional
disk space. By the way, I need to understand why this simple job generates
such a lot of data and how can I reduce this ?


Best,

Jao

Understanding and optimizing spark disk usage during a job.

Reply via email to