Re: DomainSocket issues on Solaris

Chris Nauroth Tue, 06 Oct 2015 10:12:21 -0700

I just spotted one: HADOOP-10027.  A field was removed from the Java
layer, which still could get referenced by an older version of the native
layer.  A backwards-compatible version of that patch would preserve the
old fields in the Java layer.


Full disclosure: I was the one who committed that patch, so this was a
miss by me during the code review.

--Chris Nauroth




On 10/6/15, 9:03 AM, "Chris Nauroth" <cnaur...@hortonworks.com> wrote:

>Alan, would you please list the specific patches/JIRA issues that broke
>compatibility?  I have not been reviewing the native code lately, so it
>would help me catch up quickly if you already know which specific patches
>have introduced problems.  If those patches currently reside only on trunk
>and branch-2, then they have not yet shipped in an Apache release.  We'd
>still have an opportunity to fix them and avoid "dropping the match"
>before shipping 2.8.0.
>
>Yes, we are aware that binary compatibility goes beyond the function
>signatures and into data layout and semantics.
>
>--Chris Nauroth
>
>
>
>
>On 10/6/15, 8:25 AM, "Alan Burlison" <alan.burli...@oracle.com> wrote:
>
>>On 06/10/2015 10:52, Steve Loughran wrote:
>>
>>>> That's not achievable as the method signatures need to change. Even
>>>> though they are private they need to change from static to normal
>>>> methods and the signatures need to change as well, as I said.
>>>
>>> We've done it before, simply by retaining the older method entry
>>> points. Moving from static to instance-specific is a bigger change.
>>> If the old entry points are there and retained, even if all uses have
>>> been ripped out of the hadoop code, then the new methods will get
>>> used. It's just that old stuff will still link.
>>
>>As I explained in my last email, converting the old static JNI functions
>>to be wrappers around new instance JNI functions requires a jobject
>>reference to be passed into the new function that the old one wraps
>>around. The static methods can't magic one up. An instance pointer *is*
>>available, the current code flow is Java object method -> static JNI
>>function so if we could change the JNI from static->instance then we'd
>>have what we needed. But if you are considering the JNI layer to be a
>>public interface (which I think is a big mistake, no matter how
>>convenient it might be), then you are simply screwed, both here and in
>>other places. As I've said, I have a suspicion that changes we've
>>already made have broken that compatibility anyway.
>>
>>>> JNI code is intimately  intertwined with the Java code it runs
>>>> with. Running mismatching Java & JNI versions is going to be a
>>>> recipe for eventual disaster as the JVM explicitly does *not* do
>>>> any error checking between Java and JNI.
>>>
>>> You mean jni code built for java7 isn't guaranteed to work on Java 8?
>>> If so, that's not something we knew of ‹and something to worry
>>> about.
>>
>>Actually I think that particular scenario is going to be OK. I wasn't
>>clear - sorry - what I was musing about was the fact that the Hadoop JNI
>>IO code delves into the innards of the platform Java classes and pulls
>>out bits of private data. That's explicitly not-an-interface and could
>>break at any time, although the likelihood may be low the JVM developers
>>could change it and you'd just be SOL. The same goes for all the other
>>private Java interfaces that Hadoop consumes - all the ones you get
>>warnings about when you build it. For example there are already plans to
>>make significant changes to sun.misc.unsafe for example. That will
>>affect Hadoop.
>>
>>>> At some point some innocuous change will be made that will just
>>>> cause undefined behaviour.
>>>>
>>>> I don't actually know how you'd get a JAR/JNI mismatch as they are
>>>> built and packaged together, so I'm struggling to understand what
>>>> the potential issue is here.
>>>
>>> it arises whenever you try to deploy to YARN any application
>>> containing directly or indirectly (e.g. inside the spark-assembly
>>> JAR) the Hadoop java classes of a previous Java version. libhadoop is
>>> on the PATH of the far end, your app uploads their hadoop JARs, and
>>> the moment something tries to use the JNI-backed method you get to
>>> see a stack trace.
>>>
>>> https://issues.apache.org/jira/browse/HADOOP-11064
>>>
>>> if you look at the patch there, that's the kind of thing I'd like to
>>> see to address your solaris issues.
>>
>>Hmm, yes. That's appears to be a short-term hack-around to keep things
>>running, not a fix. At very best, it's extremely fragile.
>>
>> From the bug:
>>
>>"We don't have any way of enforcing C API stability. Jenkins doesn't
>>check for it, most Java programmers don't know how to achieve it."
>>
>>In which case I think reading this will be helpful:
>>http://docs.oracle.com/cd/E19253-01/817-1984/chapter5-84101/index.html
>>
>>The assumption seems to be that as long as libhadoop.so keeps the same
>>list of functions with the same arguments then it will be
>>backwards-compatible. Unfortunately that's just flat out wrong. Binary
>>compatibility requires more than that. It also requires that there are
>>no changes to any data structures, and that the semantics of all the
>>functions remain completely unchanged. I'd put money on that not being
>>the case already. The errors you saw HADOOP-11064 are the easy ones
>>because you got a run-time linker error. The others will cause
>>mysterious behaviour, memory corruption and general WTFness.
>>
>>>> In any case the constraint you are requesting would flat-out
>>>> preclude this change, and would also mean that most of the other
>>>> JNI changes that have been committed recently would have to be
>>>> ripped out as well . In summary, the bridge is already burned.
>>>
>>> We've covered the bridge in petrol but not quite dropped a match on
>>> it.
>>
>>No, I'm reasonable certain you've already dropped the match, and if you
>>haven't its just good fortune.
>>
>>-- 
>>Alan Burlison
>>--
>>
>
>

Re: DomainSocket issues on Solaris

Reply via email to