Sean Busbey created HADOOP-11680: ------------------------------------ Summary: Deduplicate jars in convenience binary distribution Key: HADOOP-11680 URL: https://issues.apache.org/jira/browse/HADOOP-11680 Project: Hadoop Common Issue Type: Improvement Components: build Reporter: Sean Busbey Assignee: Sean Busbey
Pulled from discussion on HADOOP-11656 Colin wrote: {quote} bq. Andrew wrote: One additional note related to this, we can spend a lot of time right now distributing 100s of MBs of jar dependencies when launching a YARN job. Maybe this is ameliorated by the new shared distributed cache, but I've heard this come up quite a bit as a complaint. If we could meaningfully slim down our client, it could lead to a nice win. I'm frustrated that nobody responded to my earlier suggestion that we de-duplicate jars. This would drastically reduce the size of our install, and without rearchitecting anything. In fact I was so frustrated that I decided to write a program to do it myself and measure the delta. Here it is: Before: {code} du -h /h 249M /h {code} After: {code} du -h /h 140M /h {code} Seems like deduplicating jars would be a much better project than splitting into a client jar, if we really cared about this. <snip> {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)