Re: [DISCUSS] Adding support for Hadoop 3 and removing flink-shaded-hadoop

Chesnay Schepler Thu, 23 Apr 2020 02:51:26 -0700

This would only work so long as all Hadoop APIs do not directly exposeany transitive non-hadoop dependency.Otherwise the user code classloader might search for this transitivedependency in lib instead of the hadoop classpath (and possibly not findit).


On 23/04/2020 11:34, Stephan Ewen wrote:

True, connectors built on Hadoop make this a bit more complex. That is also
the reason why Hadoop is on the "parent first" patterns.


Maybe this is a bit of a wild thought, but what would happen if we had a
"first class" notion of a Hadoop Classloader in the system, and the user
code classloader would explicitly fall back to that one whenever a class
whose name starts with "org.apache.hadoop" is not found? We could also
generalize this by associating plugin loaders with class name prefixes.

Then it would try to load from the user code jar, and if the class was not
found, load it from the hadoop classpath.

On Thu, Apr 23, 2020 at 10:56 AM Chesnay Schepler <ches...@apache.org>
wrote:

although, if you can load the HADOOP_CLASSPATH as a plugin, then you can
also load it in the user-code classloader.

On 23/04/2020 10:50, Chesnay Schepler wrote:

@Stephan I'm not aware of anyone having tried that; possibly since we
have various connectors that require hadoop (hadoop-compat, hive,
orc/parquet/hbase, hadoop inputformats). This would require connectors
to be loaded as plugins (or having access to the plugin classloader)
to be feasible.

On 23/04/2020 09:59, Stephan Ewen wrote:

Hi all!

+1 for the simplification of dropping hadoop-shaded


Have we ever investigated how much work it would be to load the
HADOOP_CLASSPATH through the plugin loader? Then Hadoop's crazy
dependency
footprint would not spoil the main classpath.

    - HDFS might be very simple, because file systems are already
Plugin aware
    - Yarn would need some extra work. In essence, we would need to
discover
executors also through plugins
    - Kerberos is the other remaining bit. We would need to switch
security
modules to ServiceLoaders (which we should do anyways) and also pull
them
from plugins.

Best,
Stephan



On Thu, Apr 23, 2020 at 4:05 AM Xintong Song <tonysong...@gmail.com>
wrote:

+1 for supporting Hadoop 3.

I'm not familiar with the shading efforts, thus no comment on
dropping the
flink-shaded-hadoop.


Correct me if I'm wrong. Despite currently the default Hadoop
version for
compiling is 2.4.1 in Flink, I think this does not mean Flink should
support only Hadoop 2.4+. So no matter which Hadoop version we use for
compiling by default, we need to use reflection for the Hadoop
features/APIs that are not supported in all versions anyway.


There're already many such reflections in `YarnClusterDescriptor` and
`YarnResourceManager`, and might be more in future. I'm wondering
whether
we should have a unified mechanism (an interface / abstract class or
so)
that handles all these kind of Hadoop API reflections at one place. Not
necessarily in the scope to this discussion though.


Thank you~

Xintong Song



On Wed, Apr 22, 2020 at 8:32 PM Chesnay Schepler <ches...@apache.org>
wrote:

1) Likely not, as this again introduces a hard-dependency on
flink-shaded-hadoop.
2) Indeed; this will be something the user/cloud providers have to
deal
with now.
3) Yes.

As a small note, we can still keep the hadoop-2 version of
flink-shaded
around for existing users.
What I suggested was to just not release hadoop-3 versions.

On 22/04/2020 14:19, Yang Wang wrote:

Thanks Robert for starting this significant discussion.

Since hadoop3 has been released for long time and many companies have
already
put it in production. No matter you are using flink-shaded-hadoop2 or

not,

currently
Flink could already run in yarn3(not sure about HDFS). Since the yarn

api

is always
backward compatible. The difference is we could not benefit from the

new

features
because we are using hadoop-2.4 as compile dependency. So then we
need

to

use
reflector for new features(node label, tags, etc.).

All in all, i am in in favour of dropping the flink-shaded-hadoop.
Just
have some questions.
1. Do we still support "-include-hadoop" profile? If yes, what we
will

get

in the lib dir?
2. I am not sure whether dropping the flink-shaded-hadoop will take

some

class conflicts
problems. If we use "export HADOOP_CLASSPATH=`hadoop classpath`" for

the

hadoop
env setup, then many jars will be appended to the Flink client

classpath.

3. The compile hadoop version is still 2.4.1. Right?


Best,
Yang


Sivaprasanna <sivaprasanna...@gmail.com> 于2020年4月22日周三
下午4:18写道：

I agree with Aljoscha. Otherwise I can see a lot of tickets getting

created

saying the application is not running on YARN.

Cheers,
Sivaprasanna

On Wed, Apr 22, 2020 at 1:00 PM Aljoscha Krettek
<aljos...@apache.org
wrote:

+1 to getting rid of flink-shaded-hadoop. But we need to
document how
people can now get a Flink dist that works with Hadoop. Currently,

when

you download the single shaded jar you immediately get support for
submitting to YARN via bin/flink run.

Aljoscha


On 22.04.20 09:08, Till Rohrmann wrote:

Hi Robert,

I think it would be a helpful simplification of Flink's build
setup

if

we

can get rid of flink-shaded-hadoop. Moreover relying only on the

vanilla

Hadoop dependencies for the modules which interact with
Hadoop/Yarn

sounds

like a good idea to me.

Adding support for Hadoop 3 would also be nice. I'm not sure,

though,

how

Hadoop's API's have changed between 2 and 3. It might be necessary

to

introduce some bridges in order to make it work.

Cheers,
Till

On Tue, Apr 21, 2020 at 4:37 PM Robert Metzger
<rmetz...@apache.org

wrote:

Hi all,

for the upcoming 1.11 release, I started looking into adding

support

for

Hadoop 3[1] for Flink. I have explored a little bit already into

adding

shaded hadoop 3 into “flink-shaded”, and some mechanisms for

switching

between Hadoop 2 and 3 dependencies in the Flink build.

However, Chesnay made me aware that we could also go a different

route:

We

let Flink depend on vanilla Hadoop dependencies and stop
providing

shaded

fat jars for Hadoop through “flink-shaded”.

Why?
- Maintaining properly shaded Hadoop fat jars is a lot of work
(we

have

insufficient test coverage for all kinds of Hadoop features)
- For Hadoop 2, there are already some known and unresolved
issues

with

our

shaded jars that we didn’t manage to fix

Users will have to use Flink with Hadoop by relying on vanilla or
vendor-provided Hadoop dependencies.

What do you think?

Best,
Robert

[1] https://issues.apache.org/jira/browse/FLINK-11086

Re: [DISCUSS] Adding support for Hadoop 3 and removing flink-shaded-hadoop

Reply via email to