Alan, would you please list the specific patches/JIRA issues that broke compatibility? I have not been reviewing the native code lately, so it would help me catch up quickly if you already know which specific patches have introduced problems. If those patches currently reside only on trunk and branch-2, then they have not yet shipped in an Apache release. We'd still have an opportunity to fix them and avoid "dropping the match" before shipping 2.8.0.
Yes, we are aware that binary compatibility goes beyond the function signatures and into data layout and semantics. --Chris Nauroth On 10/6/15, 8:25 AM, "Alan Burlison" <alan.burli...@oracle.com> wrote: >On 06/10/2015 10:52, Steve Loughran wrote: > >>> That's not achievable as the method signatures need to change. Even >>> though they are private they need to change from static to normal >>> methods and the signatures need to change as well, as I said. >> >> We've done it before, simply by retaining the older method entry >> points. Moving from static to instance-specific is a bigger change. >> If the old entry points are there and retained, even if all uses have >> been ripped out of the hadoop code, then the new methods will get >> used. It's just that old stuff will still link. > >As I explained in my last email, converting the old static JNI functions >to be wrappers around new instance JNI functions requires a jobject >reference to be passed into the new function that the old one wraps >around. The static methods can't magic one up. An instance pointer *is* >available, the current code flow is Java object method -> static JNI >function so if we could change the JNI from static->instance then we'd >have what we needed. But if you are considering the JNI layer to be a >public interface (which I think is a big mistake, no matter how >convenient it might be), then you are simply screwed, both here and in >other places. As I've said, I have a suspicion that changes we've >already made have broken that compatibility anyway. > >>> JNI code is intimately intertwined with the Java code it runs >>> with. Running mismatching Java & JNI versions is going to be a >>> recipe for eventual disaster as the JVM explicitly does *not* do >>> any error checking between Java and JNI. >> >> You mean jni code built for java7 isn't guaranteed to work on Java 8? >> If so, that's not something we knew of ‹and something to worry >> about. > >Actually I think that particular scenario is going to be OK. I wasn't >clear - sorry - what I was musing about was the fact that the Hadoop JNI >IO code delves into the innards of the platform Java classes and pulls >out bits of private data. That's explicitly not-an-interface and could >break at any time, although the likelihood may be low the JVM developers >could change it and you'd just be SOL. The same goes for all the other >private Java interfaces that Hadoop consumes - all the ones you get >warnings about when you build it. For example there are already plans to >make significant changes to sun.misc.unsafe for example. That will >affect Hadoop. > >>> At some point some innocuous change will be made that will just >>> cause undefined behaviour. >>> >>> I don't actually know how you'd get a JAR/JNI mismatch as they are >>> built and packaged together, so I'm struggling to understand what >>> the potential issue is here. >> >> it arises whenever you try to deploy to YARN any application >> containing directly or indirectly (e.g. inside the spark-assembly >> JAR) the Hadoop java classes of a previous Java version. libhadoop is >> on the PATH of the far end, your app uploads their hadoop JARs, and >> the moment something tries to use the JNI-backed method you get to >> see a stack trace. >> >> https://issues.apache.org/jira/browse/HADOOP-11064 >> >> if you look at the patch there, that's the kind of thing I'd like to >> see to address your solaris issues. > >Hmm, yes. That's appears to be a short-term hack-around to keep things >running, not a fix. At very best, it's extremely fragile. > > From the bug: > >"We don't have any way of enforcing C API stability. Jenkins doesn't >check for it, most Java programmers don't know how to achieve it." > >In which case I think reading this will be helpful: >http://docs.oracle.com/cd/E19253-01/817-1984/chapter5-84101/index.html > >The assumption seems to be that as long as libhadoop.so keeps the same >list of functions with the same arguments then it will be >backwards-compatible. Unfortunately that's just flat out wrong. Binary >compatibility requires more than that. It also requires that there are >no changes to any data structures, and that the semantics of all the >functions remain completely unchanged. I'd put money on that not being >the case already. The errors you saw HADOOP-11064 are the easy ones >because you got a run-time linker error. The others will cause >mysterious behaviour, memory corruption and general WTFness. > >>> In any case the constraint you are requesting would flat-out >>> preclude this change, and would also mean that most of the other >>> JNI changes that have been committed recently would have to be >>> ripped out as well . In summary, the bridge is already burned. >> >> We've covered the bridge in petrol but not quite dropped a match on >> it. > >No, I'm reasonable certain you've already dropped the match, and if you >haven't its just good fortune. > >-- >Alan Burlison >-- >