reduceByKey and empty output files

2014-11-27 Thread Praveen Sripati
Hi, When I run the below program, I see two files in the HDFS because the number of partitions in 2. But, one of the file is empty. Why is it so? Is the work not distributed equally to all the tasks? textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)). *reduceByKey*(lambda a,

Number of executors and tasks

2014-11-26 Thread Praveen Sripati
Hi, I am running Spark in the stand alone mode. 1) I have a file of 286MB in HDFS (block size is 64MB) and so is split into 5 blocks. When I have the file in HDFS, 5 tasks are generated and so 5 files in the output. My understanding is that there will be a separate partition for each block and th

Spark on YARN - master role

2014-11-25 Thread Praveen Sripati
Hi, In the Spark on YARN, the AM (driver) will ask the RM for resources. Once the resources are allocated by the RM, the AM will start the executors through the NM. This is my understanding. But, according to the Spark documentation (1), the `spark.yarn.applicationMaster.waitTries` properties spe

Re: Error while calculating the max temperature

2014-09-22 Thread Praveen Sripati
If your map() sometimes does not emit an element, then you need to > call flatMap() instead, and emit Some(value) (or any collection of > values) if there is an element to return, or None otherwise. > > On Mon, Sep 22, 2014 at 4:50 PM, Praveen Sripati > wrote: > > During the

Re: Error while calculating the max temperature

2014-09-22 Thread Praveen Sripati
not able to handle the None record as input. How to get around this? Thanks, Praveen On Mon, Sep 22, 2014 at 6:09 PM, Praveen Sripati wrote: > Hi, > > I am writing a Spark program in Python to find the maximum temperature for > a year, given a weather dataset. The below program thr

Error while calculating the max temperature

2014-09-22 Thread Praveen Sripati
Hi, I am writing a Spark program in Python to find the maximum temperature for a year, given a weather dataset. The below program throws an error when I try to execute the Spark program. TypeError: 'NoneType' object is not iterable org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.sc