Re: serialization issue

2015-08-13 Thread Anish Haldiya
While submitting the job, you can use --jars, --driver-classpath etc configurations to add the jar. Apart from that if you are running the job as a standalone application, then you can use the sc.addJar option to add the jar (which will ship this jar into all the executors) Regards, Anish On 8/

Re: Reduce number of partitions before saving to file. coalesce or repartition?

2015-08-13 Thread Anish Haldiya
Hi, If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle. However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in th

Re: java.io.NotSerializableException: org.apache.avro.mapred.AvroKey using spark with avro

2014-12-18 Thread Anish Haldiya
Hi, I had the same problem. One option (starting with Spark 1.2, which is currently in preview) is to use the Avro library for Spark SQL. Other is using Kryo Serialization. by default spark uses Java Serialization, you can specify kryo serialization while creating spark context. val conf = new S