Try with large number of partition in parallelize. On 4 Jun 2015 06:28, "Justin Spargur" <jmspar...@gmail.com> wrote:
> Hi all, > > I'm playing around with manipulating images via Python and want to > utilize Spark for scalability. That said, I'm just learing Spark and my > Python is a bit rusty (been doing PHP coding for the last few years). I > think I have most of the process figured out. However, the script fails on > larger images and Spark is sending out the following warning for smaller > images: > > Stage 0 contains a task of very large size (1151 KB). The maximum > recommended task size is 100 KB. > > My code is as follows: > > import Image > from pyspark import SparkContext > > if __name__ == "__main__": > > imageFile = "sample.jpg" > outFile = "sample.gray.jpg" > > sc = SparkContext(appName="Grayscale") > im = Image.open(imageFile) > > # Create an RDD for the data from the image file > img_data = sc.parallelize( list(im.getdata()) ) > > # Create an RDD for the grayscale value > gValue = img_data.map( lambda x: int(x[0]*0.21 + x[1]*0.72 + > x[2]*0.07) ) > > # Put our grayscale value into the RGR channels > grayscale = gValue.map( lambda x: (x,x,x) ) > > # Save the output in a new image. > im.putdata( grayscale.collect() ) > > im.save(outFile) > > Obviously, something is amiss. However, I can't figure out where I'm off > track with this. Any help is appreciated! Thanks in advance!!! >