Hi Hadoop devs,

I the past, Hadoop tends to be pretty far behind the latest versions of
dependencies. Part of that is due to the fear of the breaking changes
brought in by the dependency updates.

However, things have changed dramatically over the past few years. With
more focus on security vulnerabilities, more vulnerabilities are discovered
in our dependencies, and users put more pressure on patching Hadoop (and
its ecosystem) to use the latest dependency versions.

As an example, Jackson-databind had 20 CVEs published in the last year
alone.
https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866

Jetty: 4 CVEs in 2019:
https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410

We can no longer keep Hadoop stay behind. The more we stay behind, the
harder it is to update. A good example is Jersey migration 1 -> 2
HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984> contributed
by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is hard.
If any critical vulnerability is found in Jersey 1, it will leave us in a
bad situation since we can't simply update Jersey version and be done.

Hadoop 3 adds new public artifacts that shade these dependencies. We should
advocate downstream applications to use the public artifacts to avoid
breakage.

I'd like to hear your thoughts: are you okay to see Hadoop keep up with the
latest dependency updates, or would rather stay behind to ensure
compatibility?

Coupled with that, I'd like to call for more frequent Hadoop releases for
the same purpose. IMHO that'll require better infrastructure to assist the
release work and some rethinking our current Hadoop code structure, like
separate each subproject into its own repository and release cadence. This
can be controversial but I think it'll be good for the project in the long
run.

Thanks,
Wei-Chiu

Reply via email to