Re: [DISCUSS] Looking to a 2.8.0 release

Steve Loughran Thu, 12 Nov 2015 05:26:32 -0800

There's a lot of stuff in 2.8; I note that I'd like to see the s3a perf 
improvements & openstack fixes in there: for which I need reviewers. I don't 
have the spare time to do this myself.

I've already been building & testing both Apache Slider (incubating) and Apache 
Spark against both 2.8.0-SNAPSHOT & 3.0.0-SNAPSHOT. 

What's been troublesome for builds which use maven as the way of managing 
dependencies (I'm ignoring the fact that spark *also* has an SBT build with ivy 
doing dep management)? HDFS client

-hadoop-hdfs-client pulled HdfsConfiguration. I'd been explicitly creating this 
to force in hdfs-default.xml & hdfs-site.xml loading, so that I could do sanity 
checks on things like security settings prior to attempting AM launch.

-likewise, DFSConfigKeys stayed in hdfs-server. I know it's tagged as @Private, 
but it's long been where all the string constants for HDFS options live. 
Forcing users to retype them in their own source is not only dangerous (it only 
encourages typos), it actually stops you using your IDE finding out where those 
constants get used. 

We do now have a set of keys in the client, HdfsClientConfigKeys, but these are 
still declared as @Private. Which is a mistake for the reasons above, and 
because it encourages hadoop developers to assume that they are free to make 
whatever changes they want to this code, and if it breaks something, say "it 
was tagged as private"

1. We have to recognise that a lot of things marked @Private are in fact 
essential for clients to use. Not just constants, but actual classes.

2. We have to look hard at @LimitedPrivate and question the legitimacy of 
tagging things as so, especially anything 
"@InterfaceAudience.LimitedPrivate({""MapReduce"}) —because any YARN app you 
write ends up needing those classes. For evidence, look at DistributedShell's 
imports, and pick a few at random: NMClientAsyncImpl, ConverterUtils being easy 
targets.

3. Or for real fun, UGI: @InterfaceAudience.LimitedPrivate({"HDFS", 
"MapReduce", "HBase", "Hive", "Oozie"})

I'd advocate marking all "MapReduce" as "YarnApp" and have people working on 
those classes accept that they will be used downstream and treat changes with 
caution. Yes, they may be messy, but how things are used. At least with a 
modern IDE you can add in the downstream projects and identify those uses with 
ease.

In the end SLIDER-948 addressed the problems for me. I switched to pulling in 
hadoop-hdfs *and* copied and pasted all the DFSConfigurationKeys I used into my 
own file of constants. 

HDFS-9301 should make these changes things I could revert —and other projects 
not notice them ever existing —but I've left them them in to isolate me from 
any more situations like this. To be completely ruthless: I don't trust that 
code to not break my builds any more.

Behaviour-wise, I've not seen much in the way of changes; all tests work the 
same. Oh and Spark wouldn't compile against 3.0 as an exception tagged as 
@Deprecated since Hadoop 0.18 got pulled. Trivially fixed.

Returning to the pending 2.8.0 release, there's a way to find out what's going 
to break: build and test things against the snapshots, without waiting for the 
beta releases and expecting the downstream projects to do it for you. If they 
don't build, that's a success: you've found a compatibility problem to fix. If 
they don't test, well that's trouble —you are in finger pointing time.

-Steve

> On 11 Nov 2015, at 23:26, Haohui Mai <ricet...@gmail.com> wrote:
> 
> bq. If and only if they take the Hadoop class path at face value.
> Many applications don’t because of conflicting dependencies and
> instead import specific jars.
> 
> We do make the assumptions that applications need to pick up all the
> dependency (either automatically or manually). The situation is
> similar with adding a new dependency into hdfs in a minor release.
> 
> Maven / gradle obviously help, but I'd love to hear more about it how
> you get it to work. In trunk hadoop-env.sh adds 118 jars into the
> class path. Are you manually importing 118 jars for every single
> applications?
> 
> 
> 
> On Wed, Nov 11, 2015 at 3:09 PM, Haohui Mai <ricet...@gmail.com> wrote:
>> bq. currently pulling in hadoop-client gives downstream apps
>> hadoop-hdfs-client, but not hadoop-hdfs server side, right?
>> 
>> Right now hadoop-client pulls in hadoop-hdfs directly to ensure a
>> smooth transition. Maybe we can revisit the decision in the 2.9 / 3.x?
>>

Re: [DISCUSS] Looking to a 2.8.0 release

Reply via email to