I tried to load a custom type from avro files into a RDD using the
newAPIHadoopFile. I started with the following naive code:

JavaPairRDD<MyCustomClass, NullWritable> events =
                sc.newAPIHadoopFile("file:/path/to/data.avro",
                AvroKeyInputFormat.class, MyCustomClass.class,
NullWritable.class,
                sc.hadoopConfiguration());
Tuple2<MyCustomClass, NullWritable> first = events.first();

This doesn't work and shouldn't work, because the AvroKeyInputFormat returns
a GenericData$Record. The thing is it compiles, and you can even assign the
first tuple to the variable "first". You will get a runtime error only when
you try to access a field of MyCustomClass from the tuple (e.g
first._1.getSomeField()).
This behavior sent me on a wild goose chase that took many hours over many
weeks to figure out, because I never expected the method to return a wrong
type at runtime. If there's a mismatch between what the InputFormat returns
and the class I'm trying to load - shouldn't this be a compilation error? Or
at least the runtime error should occur already when I try to assign the
tuple to a variable of the wrong type. This is very unexpected behavior.

Moreover, I actually fixed my code and implemented all the required wrapper
and custom classes:
JavaPairRDD<MyCustomAvroKey, NullWritable> records =
                sc.newAPIHadoopFile("file:/path/to/data.avro",
                        MyCustomInputFormat.class, MyCustomAvroKey.class,
                        NullWritable.class,
                        sc.hadoopConfiguration());
Tuple2<MyCustomAvroKey, NullWritable> first = records.first();
MyCustomAvroKey customKey = first._1;

But this time I forgot that I moved the class to another package so the
namespace in the schema file was wrong. And again, in runtime the method
datum() of customKey returned a GenericData$Record instead of a
MyCustomClass.

Now, I understand that this has to do with the avro library (the
GenericDatumReader class has an "expected" and "actual" schema, and it
defaults to a GenericData$Record if something is wrong with my schema). But
does it really make sense to return a different class from this API, which
is not even assignable to my class, when this happens? Why would I ever get
a class U from a wrapper class declared to be a Wrapper<T>? It's just
confusing and makes it so much harder to pinpoint the real problem.

As I said, this weird behavior cost me a lot of time, and I've been googling
this for weeks and am getting the impression that very few Java developers
figured this API out. I posted  a question
<http://stackoverflow.com/questions/41836851/wrong-runtime-type-in-rdd-when-reading-from-avro-with-custom-serializer>
  
about it in StackOverflow and got several views and upvotes but no replies
(a  similar question
<http://stackoverflow.com/questions/41834120/override-avroio-default-coder-in-dataflow/>
  
about loading custom types in Google Dataflow got answered within a couple
of days).

I think this behavior should be considered a bug.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Wrong-runtime-type-when-using-newAPIHadoopFile-in-Java-tp28459.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to