Hi all, 

I have piece of code written in spark that loads data from HDFS into java
classes generated from avro idl. On RDD created in that way I am executing
simple operation which results depends on fact whether I cache RDD before it
or not i.e if I run code below

val loadedData = loadFromHDFS[Data](path,...)
println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()) //
200000
program will print 200000, on the other hand executing next code

val loadedData = loadFromHDFS[Data](path,...).cache()
println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()) //
1
result in 1 printed to stdout.

When I inspect values of the fields after reading cached data it seems

I am pretty sure that root cause of described problem is issue with
serialization of classes generated from avro idl, but I do not know how to
resolve it. I tried to use Kryo, registering generated class (Data),
registering different serializers from chill_avro for given class
(SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but none of
those ideas helps me.

I post exactly the same question on stackoverflow but I did not receive any
repsponse.  link
<http://stackoverflow.com/questions/33027851/spark-issue-with-the-class-generated-from-avro-schema>
  

What is more I created minimal working example, thanks to which it will be
easy to reproduce problem.
link <https://github.com/alberskib/spark-avro-serialization-issue>  

How I can solve this problem?


Thanks,
Bartek



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-the-class-generated-from-avro-schema-tp24997.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to