Hi all, I have piece of code written in spark that loads data from HDFS into java classes generated from avro idl. On RDD created in that way I am executing simple operation which results depends on fact whether I cache RDD before it or not i.e if I run code below
val loadedData = loadFromHDFS[Data](path,...) println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()) // 200000 program will print 200000, on the other hand executing next code val loadedData = loadFromHDFS[Data](path,...).cache() println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()) // 1 result in 1 printed to stdout. When I inspect values of the fields after reading cached data it seems I am pretty sure that root cause of described problem is issue with serialization of classes generated from avro idl, but I do not know how to resolve it. I tried to use Kryo, registering generated class (Data), registering different serializers from chill_avro for given class (SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but none of those ideas helps me. I post exactly the same question on stackoverflow but I did not receive any repsponse. link <http://stackoverflow.com/questions/33027851/spark-issue-with-the-class-generated-from-avro-schema> What is more I created minimal working example, thanks to which it will be easy to reproduce problem. link <https://github.com/alberskib/spark-avro-serialization-issue> How I can solve this problem? Thanks, Bartek -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-the-class-generated-from-avro-schema-tp24997.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org