And by the way - I don't want the Avro details to be hidden away from me.
The whole purpose of the work I'm doing is to benchmark different
serialization tools and strategies. If I want to use Kryo serialization for
example, then I need to understand how the API works. And it's very
difficult if it's doing unexpected things.

On Mon, Mar 6, 2017 at 1:25 PM, Nira Amit <amitn...@gmail.com> wrote:

> Hi Sean,
> Yes, we discussed this in Jira and you suggested I take this discussion to
> the mailing list, so I did.
> I don't have the option to migrate the code I'm working on to Datasets at
> the moment (or to Scala, as another developer suggested in the Jira
> discussion), so I have to work with the the Java RDD API.
> I've been working with Java for many years and understand that not all
> type errors can be caught in compile time. What I don't understand is how
> you manage to create an object of type AvroKey<MyCustomType> with the
> actual datum it encloses being GenericData$Record. If my code threw a
> RuntimeException in the line `MyCustomAvroKey customKey = first._1;` for
> example, saying it has a AvroKey<GenericData$Record> - then there would
> be no confusion. But what happens in practice is that somehow my customKey
> is of type AvroKey<MyCustomType> and only when I try to retrieve the
> MyCustomType datum I get the exception. There must be some hackish things
> going on under the hood here, because this is just not how Java is supposed
> to work.
> Which is why I still think that this should be considered a bug.
>
> On Mon, Mar 6, 2017 at 1:02 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> I think this is the same thing we already discussed extensively on your
>> JIRA.
>>
>> The type of the key/value class argument to newAPIHadoopFile are not the
>> type of your custom class, but of the Writable describing encoding of keys
>> and values in the file. I think that's the start of part of the problem.
>> This is how all Hadoop-related APIs would work, because Hadoop uses
>> Writables for encoding.
>>
>> You're asking again why it isn't caught at compile time, and that stems
>> from two basic causes. First is the way the underlying Hadoop API works,
>> needing Class parameters because of it's Java roots. Second is the
>> Scala/Java difference; the Scala API will accept, for instance,
>> non-Writable arguments if you can supply implicit conversion to Writable
>> (if I recall correctly). This isn't available in Java, leaving its API
>> expressing flexibility that isn't there. This isn't the exact issue here;
>> it's that you're using raw class literals in Java which have no generic
>> types -- they are Class<?>. The InputFormat arg expresses nothing about the
>> key/value types; there's nothing to 'contradict' your declaration, which is
>> doesn't represent the actual types correctly. (You can cast class literals
>> to (Class<..>) to express this if you want. It's a little mess in Java.)
>> That's why it compiles just as any Java code with an invalid cast compiles
>> but fails at runtime.
>>
>> It is a bit weird if you're not familiar with the Hadoop APIs, Writables,
>> or how Class arguments shake out in the context of generics. It does take
>> the research you did. It does work as you've found. The reason you were
>> steered several times to the DataFrame API is that it can hide a lot of
>> this from you, including details of Avro and Writables. You're directly
>> accessing Hadoop APIs that are foreign to you.
>>
>> This and the JIRA do not describe a bug.
>>
>>
>>
>> On Mon, Mar 6, 2017 at 11:29 AM Nira <amitn...@gmail.com> wrote:
>>
>>> I tried to load a custom type from avro files into a RDD using the
>>> newAPIHadoopFile. I started with the following naive code:
>>>
>>> JavaPairRDD<MyCustomClass, NullWritable> events =
>>>                 sc.newAPIHadoopFile("file:/path/to/data.avro",
>>>                 AvroKeyInputFormat.class, MyCustomClass.class,
>>> NullWritable.class,
>>>                 sc.hadoopConfiguration());
>>> Tuple2<MyCustomClass, NullWritable> first = events.first();
>>>
>>> This doesn't work and shouldn't work, because the AvroKeyInputFormat
>>> returns
>>> a GenericData$Record. The thing is it compiles, and you can even assign
>>> the
>>> first tuple to the variable "first". You will get a runtime error only
>>> when
>>> you try to access a field of MyCustomClass from the tuple (e.g
>>> first._1.getSomeField()).
>>> This behavior sent me on a wild goose chase that took many hours over
>>> many
>>> weeks to figure out, because I never expected the method to return a
>>> wrong
>>> type at runtime. If there's a mismatch between what the InputFormat
>>> returns
>>> and the class I'm trying to load - shouldn't this be a compilation
>>> error? Or
>>> at least the runtime error should occur already when I try to assign the
>>> tuple to a variable of the wrong type. This is very unexpected behavior.
>>>
>>> Moreover, I actually fixed my code and implemented all the required
>>> wrapper
>>> and custom classes:
>>> JavaPairRDD<MyCustomAvroKey, NullWritable> records =
>>>                 sc.newAPIHadoopFile("file:/path/to/data.avro",
>>>                         MyCustomInputFormat.class, MyCustomAvroKey.class,
>>>                         NullWritable.class,
>>>                         sc.hadoopConfiguration());
>>> Tuple2<MyCustomAvroKey, NullWritable> first = records.first();
>>> MyCustomAvroKey customKey = first._1;
>>>
>>> But this time I forgot that I moved the class to another package so the
>>> namespace in the schema file was wrong. And again, in runtime the method
>>> datum() of customKey returned a GenericData$Record instead of a
>>> MyCustomClass.
>>>
>>> Now, I understand that this has to do with the avro library (the
>>> GenericDatumReader class has an "expected" and "actual" schema, and it
>>> defaults to a GenericData$Record if something is wrong with my schema).
>>> But
>>> does it really make sense to return a different class from this API,
>>> which
>>> is not even assignable to my class, when this happens? Why would I ever
>>> get
>>> a class U from a wrapper class declared to be a Wrapper<T>? It's just
>>> confusing and makes it so much harder to pinpoint the real problem.
>>>
>>> As I said, this weird behavior cost me a lot of time, and I've been
>>> googling
>>> this for weeks and am getting the impression that very few Java
>>> developers
>>> figured this API out. I posted  a question
>>> <http://stackoverflow.com/questions/41836851/wrong-runtime-
>>> type-in-rdd-when-reading-from-avro-with-custom-serializer>
>>> about it in StackOverflow and got several views and upvotes but no
>>> replies
>>> (a  similar question
>>> <http://stackoverflow.com/questions/41834120/override-avroio
>>> -default-coder-in-dataflow/>
>>> about loading custom types in Google Dataflow got answered within a
>>> couple
>>> of days).
>>>
>>> I think this behavior should be considered a bug.
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Wrong-runtime-type-when-using-newAPIHa
>>> doopFile-in-Java-tp28459.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>

Reply via email to