Re: serialization issue with mapPartitions

Tobias Pfeiffer Thu, 25 Dec 2014 17:09:46 -0800

Hi,

On Fri, Dec 26, 2014 at 1:32 AM, ey-chih chow <eyc...@hotmail.com> wrote:
>
> I got some issues with mapPartitions with the following piece of code:
>
>     val sessions = sc
>       .newAPIHadoopFile(
>         "... path to an avro file ...",
>         classOf[org.apache.avro.mapreduce.AvroKeyInputFormat[ByteBuffer]],
>         classOf[AvroKey[ByteBuffer]],
>         classOf[NullWritable],
>         job.getConfiguration())
>       .mapPartitions { valueIterator =>
>         val config = job.getConfiguration()
>                          .
>       }
>       .collect()
>
> Why job.getConfiguration() in the function mapPartitions will generate the
> following message?
>
> Cause: java.io.NotSerializableException: org.apache.hadoop.mapreduce.Job
>


The functions inside mapPartitions() will be executed on the Spark
executors, not the Spark driver. Therefore, the function body needs to be
serialized and sent to the executors via network. If that is not possible
(in your case, `job` cannot be serialized), you will get a
NotSerializableException. It works inside newAPIHadoopFile because this is
executed on the driver.

Tobias

Re: serialization issue with mapPartitions

Reply via email to