And by the way - I don't want the Avro details to be hidden away from me. The whole purpose of the work I'm doing is to benchmark different serialization tools and strategies. If I want to use Kryo serialization for example, then I need to understand how the API works. And it's very difficult if it's doing unexpected things.
On Mon, Mar 6, 2017 at 1:25 PM, Nira Amit <amitn...@gmail.com> wrote: > Hi Sean, > Yes, we discussed this in Jira and you suggested I take this discussion to > the mailing list, so I did. > I don't have the option to migrate the code I'm working on to Datasets at > the moment (or to Scala, as another developer suggested in the Jira > discussion), so I have to work with the the Java RDD API. > I've been working with Java for many years and understand that not all > type errors can be caught in compile time. What I don't understand is how > you manage to create an object of type AvroKey<MyCustomType> with the > actual datum it encloses being GenericData$Record. If my code threw a > RuntimeException in the line `MyCustomAvroKey customKey = first._1;` for > example, saying it has a AvroKey<GenericData$Record> - then there would > be no confusion. But what happens in practice is that somehow my customKey > is of type AvroKey<MyCustomType> and only when I try to retrieve the > MyCustomType datum I get the exception. There must be some hackish things > going on under the hood here, because this is just not how Java is supposed > to work. > Which is why I still think that this should be considered a bug. > > On Mon, Mar 6, 2017 at 1:02 PM, Sean Owen <so...@cloudera.com> wrote: > >> I think this is the same thing we already discussed extensively on your >> JIRA. >> >> The type of the key/value class argument to newAPIHadoopFile are not the >> type of your custom class, but of the Writable describing encoding of keys >> and values in the file. I think that's the start of part of the problem. >> This is how all Hadoop-related APIs would work, because Hadoop uses >> Writables for encoding. >> >> You're asking again why it isn't caught at compile time, and that stems >> from two basic causes. First is the way the underlying Hadoop API works, >> needing Class parameters because of it's Java roots. Second is the >> Scala/Java difference; the Scala API will accept, for instance, >> non-Writable arguments if you can supply implicit conversion to Writable >> (if I recall correctly). This isn't available in Java, leaving its API >> expressing flexibility that isn't there. This isn't the exact issue here; >> it's that you're using raw class literals in Java which have no generic >> types -- they are Class<?>. The InputFormat arg expresses nothing about the >> key/value types; there's nothing to 'contradict' your declaration, which is >> doesn't represent the actual types correctly. (You can cast class literals >> to (Class<..>) to express this if you want. It's a little mess in Java.) >> That's why it compiles just as any Java code with an invalid cast compiles >> but fails at runtime. >> >> It is a bit weird if you're not familiar with the Hadoop APIs, Writables, >> or how Class arguments shake out in the context of generics. It does take >> the research you did. It does work as you've found. The reason you were >> steered several times to the DataFrame API is that it can hide a lot of >> this from you, including details of Avro and Writables. You're directly >> accessing Hadoop APIs that are foreign to you. >> >> This and the JIRA do not describe a bug. >> >> >> >> On Mon, Mar 6, 2017 at 11:29 AM Nira <amitn...@gmail.com> wrote: >> >>> I tried to load a custom type from avro files into a RDD using the >>> newAPIHadoopFile. I started with the following naive code: >>> >>> JavaPairRDD<MyCustomClass, NullWritable> events = >>> sc.newAPIHadoopFile("file:/path/to/data.avro", >>> AvroKeyInputFormat.class, MyCustomClass.class, >>> NullWritable.class, >>> sc.hadoopConfiguration()); >>> Tuple2<MyCustomClass, NullWritable> first = events.first(); >>> >>> This doesn't work and shouldn't work, because the AvroKeyInputFormat >>> returns >>> a GenericData$Record. The thing is it compiles, and you can even assign >>> the >>> first tuple to the variable "first". You will get a runtime error only >>> when >>> you try to access a field of MyCustomClass from the tuple (e.g >>> first._1.getSomeField()). >>> This behavior sent me on a wild goose chase that took many hours over >>> many >>> weeks to figure out, because I never expected the method to return a >>> wrong >>> type at runtime. If there's a mismatch between what the InputFormat >>> returns >>> and the class I'm trying to load - shouldn't this be a compilation >>> error? Or >>> at least the runtime error should occur already when I try to assign the >>> tuple to a variable of the wrong type. This is very unexpected behavior. >>> >>> Moreover, I actually fixed my code and implemented all the required >>> wrapper >>> and custom classes: >>> JavaPairRDD<MyCustomAvroKey, NullWritable> records = >>> sc.newAPIHadoopFile("file:/path/to/data.avro", >>> MyCustomInputFormat.class, MyCustomAvroKey.class, >>> NullWritable.class, >>> sc.hadoopConfiguration()); >>> Tuple2<MyCustomAvroKey, NullWritable> first = records.first(); >>> MyCustomAvroKey customKey = first._1; >>> >>> But this time I forgot that I moved the class to another package so the >>> namespace in the schema file was wrong. And again, in runtime the method >>> datum() of customKey returned a GenericData$Record instead of a >>> MyCustomClass. >>> >>> Now, I understand that this has to do with the avro library (the >>> GenericDatumReader class has an "expected" and "actual" schema, and it >>> defaults to a GenericData$Record if something is wrong with my schema). >>> But >>> does it really make sense to return a different class from this API, >>> which >>> is not even assignable to my class, when this happens? Why would I ever >>> get >>> a class U from a wrapper class declared to be a Wrapper<T>? It's just >>> confusing and makes it so much harder to pinpoint the real problem. >>> >>> As I said, this weird behavior cost me a lot of time, and I've been >>> googling >>> this for weeks and am getting the impression that very few Java >>> developers >>> figured this API out. I posted a question >>> <http://stackoverflow.com/questions/41836851/wrong-runtime- >>> type-in-rdd-when-reading-from-avro-with-custom-serializer> >>> about it in StackOverflow and got several views and upvotes but no >>> replies >>> (a similar question >>> <http://stackoverflow.com/questions/41834120/override-avroio >>> -default-coder-in-dataflow/> >>> about loading custom types in Google Dataflow got answered within a >>> couple >>> of days). >>> >>> I think this behavior should be considered a bug. >>> >>> >>> >>> >>> -- >>> View this message in context: http://apache-spark-user-list. >>> 1001560.n3.nabble.com/Wrong-runtime-type-when-using-newAPIHa >>> doopFile-in-Java-tp28459.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >