Hi all,
I'm playing around with manipulating images via Python and want to
utilize Spark for scalability. That said, I'm just learing Spark and my
Python is a bit rusty (been doing PHP coding for the last few years). I
think I have most of the process figured out. However, the script fails on
larger images and Spark is sending out the following warning for smaller
images:
Stage 0 contains a task of very large size (1151 KB). The maximum
recommended task size is 100 KB.
My code is as follows:
import Image
from pyspark import SparkContext
if __name__ == "__main__":
imageFile = "sample.jpg"
outFile = "sample.gray.jpg"
sc = SparkContext(appName="Grayscale")
im = Image.open(imageFile)
# Create an RDD for the data from the image file
img_data = sc.parallelize( list(im.getdata()) )
# Create an RDD for the grayscale value
gValue = img_data.map( lambda x: int(x[0]*0.21 + x[1]*0.72 + x[2]*0.07)
)
# Put our grayscale value into the RGR channels
grayscale = gValue.map( lambda x: (x,x,x) )
# Save the output in a new image.
im.putdata( grayscale.collect() )
im.save(outFile)
Obviously, something is amiss. However, I can't figure out where I'm off
track with this. Any help is appreciated! Thanks in advance!!!