Hi Sean, The solution of instantiating the non-serializable class inside the map is working fine, but I hit a road block. The solution is not working for singleton classes like UserGroupInformation.
In my logic as part of processing a HDFS file, I need to refer to some reference files which are again available in HDFS. So inside the map method I'm trying to instantiate UserGroupInformation to get an instance of FileSystem. Then using this FileSystem instance I read those reference files and use that data in my processing logic. This is throwing task not serializable exceptions for 'UserGroupInformation' and 'FileSystem' classes. I also tried using 'SparkHadoopUtil' instead of 'UserGroupInformation'. But it didn't resolve the issue. Request you provide some pointers in this regard. Also I have a query - when we instantiate a class inside map method, does it create a new instance for every RDD it is processing? Thanks & Regards, *Sarath* On Sat, Sep 6, 2014 at 4:32 PM, Sean Owen <so...@cloudera.com> wrote: > I disagree that the generally right change is to try to make the > classes serializable. Usually, classes that are not serializable are > not supposed to be serialized. You're using them in a way that's > causing them to be serialized, and that's probably not desired. > > For example, this is wrong: > > val foo: SomeUnserializableManagerClass = ... > rdd.map(d => foo.bar(d)) > > This is right: > > rdd.map { d => > val foo: SomeUnserializableManagerClass = ... > foo.bar(d) > } > > In the first instance, you create the object on the driver and try to > serialize and copy it to workers. In the second, you're creating > SomeUnserializableManagerClass in the function and therefore on the > worker. > > mapPartitions is better if this creation is expensive. > > On Fri, Sep 5, 2014 at 3:06 PM, Sarath Chandra > <sarathchandra.jos...@algofusiontech.com> wrote: > > Hi, > > > > I'm trying to migrate a map-reduce program to work with spark. I migrated > > the program from Java to Scala. The map-reduce program basically loads a > > HDFS file and for each line in the file it applies several transformation > > functions available in various external libraries. > > > > When I execute this over spark, it is throwing me "Task not serializable" > > exceptions for each and every class being used from these from external > > libraries. I included serialization to few classes which are in my scope, > > but there there are several other classes which are out of my scope like > > org.apache.hadoop.io.Text. > > > > How to overcome these exceptions? > > > > ~Sarath. >