Hey Robert and others, overall +1 to support Hadoop 3. It would be a great to unblock Flink support in EMR 6.0 as noted in the linked FLINK ticket.
The arguments raised against flink-shaded-hadoop make sense to me. I have a few general questions still: 1) Will the flink-shaded-hadoop module (in apache/flink-shaded) be fully dropped after this change? Or do you plan to keep it (allowing users to build their own shaded Hadoop if needed)? 2) I find Stephan's ideas pretty interesting. Will there be an official follow-up to investigate those? 3) What will we tell users that run into class loading conflicts after this change? What are actually the "expected" conflicts we might see? – Ufuk PS: Robert opened a draft PR here: https://github.com/apache/flink/pull/11983 On Sun, May 3, 2020 at 12:02 PM Konstantin Knauf <kna...@apache.org> wrote: > Hi Chesnay, Hi Robert, > > I have a bit of a naive question. I assume the reason for introducing > flink-shaded-hadoop were dependency conflicts between Hadoop, Flink and/or > user code. When we drop it now is it because > > a) it was not worth it (value provided did not justify maintenance overhead > and issues introduced) > b) we don't think it is a problem anymore > c) prioritizes have shifted and it *now *not worth it anymore > d) something else > > Cheers, > > Konstantin > > On Sun, Apr 26, 2020 at 10:25 PM Stephan Ewen <se...@apache.org> wrote: > > > Indeed, that would be the assumption, that Hadoop does not expose its > > transitive libraries on its public API surface. > > > > From vague memory, I think that pretty much true so far. I only remember > > Kinesis and Calcite as counter examples, who exposed Guava classes as > part > > of the public API. > > But that is definitely the "weak spot" of this approach. Plus, as with > all > > custom class loaders, the fact that the Thread Context Class Loader does > > not really work well any more. > > > > On Thu, Apr 23, 2020 at 11:50 AM Chesnay Schepler <ches...@apache.org> > > wrote: > > > > > This would only work so long as all Hadoop APIs do not directly expose > > > any transitive non-hadoop dependency. > > > Otherwise the user code classloader might search for this transitive > > > dependency in lib instead of the hadoop classpath (and possibly not > find > > > it). > > > > > > On 23/04/2020 11:34, Stephan Ewen wrote: > > > > True, connectors built on Hadoop make this a bit more complex. That > is > > > also > > > > the reason why Hadoop is on the "parent first" patterns. > > > > > > > > Maybe this is a bit of a wild thought, but what would happen if we > had > > a > > > > "first class" notion of a Hadoop Classloader in the system, and the > > user > > > > code classloader would explicitly fall back to that one whenever a > > class > > > > whose name starts with "org.apache.hadoop" is not found? We could > also > > > > generalize this by associating plugin loaders with class name > prefixes. > > > > > > > > Then it would try to load from the user code jar, and if the class > was > > > not > > > > found, load it from the hadoop classpath. > > > > > > > > On Thu, Apr 23, 2020 at 10:56 AM Chesnay Schepler < > ches...@apache.org> > > > > wrote: > > > > > > > >> although, if you can load the HADOOP_CLASSPATH as a plugin, then you > > can > > > >> also load it in the user-code classloader. > > > >> > > > >> On 23/04/2020 10:50, Chesnay Schepler wrote: > > > >>> @Stephan I'm not aware of anyone having tried that; possibly since > we > > > >>> have various connectors that require hadoop (hadoop-compat, hive, > > > >>> orc/parquet/hbase, hadoop inputformats). This would require > > connectors > > > >>> to be loaded as plugins (or having access to the plugin > classloader) > > > >>> to be feasible. > > > >>> > > > >>> On 23/04/2020 09:59, Stephan Ewen wrote: > > > >>>> Hi all! > > > >>>> > > > >>>> +1 for the simplification of dropping hadoop-shaded > > > >>>> > > > >>>> > > > >>>> Have we ever investigated how much work it would be to load the > > > >>>> HADOOP_CLASSPATH through the plugin loader? Then Hadoop's crazy > > > >>>> dependency > > > >>>> footprint would not spoil the main classpath. > > > >>>> > > > >>>> - HDFS might be very simple, because file systems are already > > > >>>> Plugin aware > > > >>>> - Yarn would need some extra work. In essence, we would need > to > > > >>>> discover > > > >>>> executors also through plugins > > > >>>> - Kerberos is the other remaining bit. We would need to switch > > > >>>> security > > > >>>> modules to ServiceLoaders (which we should do anyways) and also > pull > > > >>>> them > > > >>>> from plugins. > > > >>>> > > > >>>> Best, > > > >>>> Stephan > > > >>>> > > > >>>> > > > >>>> > > > >>>> On Thu, Apr 23, 2020 at 4:05 AM Xintong Song < > tonysong...@gmail.com > > > > > > >>>> wrote: > > > >>>> > > > >>>>> +1 for supporting Hadoop 3. > > > >>>>> > > > >>>>> I'm not familiar with the shading efforts, thus no comment on > > > >>>>> dropping the > > > >>>>> flink-shaded-hadoop. > > > >>>>> > > > >>>>> > > > >>>>> Correct me if I'm wrong. Despite currently the default Hadoop > > > >>>>> version for > > > >>>>> compiling is 2.4.1 in Flink, I think this does not mean Flink > > should > > > >>>>> support only Hadoop 2.4+. So no matter which Hadoop version we > use > > > for > > > >>>>> compiling by default, we need to use reflection for the Hadoop > > > >>>>> features/APIs that are not supported in all versions anyway. > > > >>>>> > > > >>>>> > > > >>>>> There're already many such reflections in `YarnClusterDescriptor` > > and > > > >>>>> `YarnResourceManager`, and might be more in future. I'm wondering > > > >>>>> whether > > > >>>>> we should have a unified mechanism (an interface / abstract class > > or > > > >>>>> so) > > > >>>>> that handles all these kind of Hadoop API reflections at one > place. > > > Not > > > >>>>> necessarily in the scope to this discussion though. > > > >>>>> > > > >>>>> > > > >>>>> Thank you~ > > > >>>>> > > > >>>>> Xintong Song > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> On Wed, Apr 22, 2020 at 8:32 PM Chesnay Schepler < > > ches...@apache.org > > > > > > > >>>>> wrote: > > > >>>>> > > > >>>>>> 1) Likely not, as this again introduces a hard-dependency on > > > >>>>>> flink-shaded-hadoop. > > > >>>>>> 2) Indeed; this will be something the user/cloud providers have > to > > > >>>>>> deal > > > >>>>>> with now. > > > >>>>>> 3) Yes. > > > >>>>>> > > > >>>>>> As a small note, we can still keep the hadoop-2 version of > > > >>>>>> flink-shaded > > > >>>>>> around for existing users. > > > >>>>>> What I suggested was to just not release hadoop-3 versions. > > > >>>>>> > > > >>>>>> On 22/04/2020 14:19, Yang Wang wrote: > > > >>>>>>> Thanks Robert for starting this significant discussion. > > > >>>>>>> > > > >>>>>>> Since hadoop3 has been released for long time and many > companies > > > have > > > >>>>>>> already > > > >>>>>>> put it in production. No matter you are using > > flink-shaded-hadoop2 > > > or > > > >>>>>> not, > > > >>>>>>> currently > > > >>>>>>> Flink could already run in yarn3(not sure about HDFS). Since > the > > > yarn > > > >>>>> api > > > >>>>>>> is always > > > >>>>>>> backward compatible. The difference is we could not benefit > from > > > the > > > >>>>> new > > > >>>>>>> features > > > >>>>>>> because we are using hadoop-2.4 as compile dependency. So then > we > > > >>>>>>> need > > > >>>>> to > > > >>>>>>> use > > > >>>>>>> reflector for new features(node label, tags, etc.). > > > >>>>>>> > > > >>>>>>> All in all, i am in in favour of dropping the > > flink-shaded-hadoop. > > > >>>>>>> Just > > > >>>>>>> have some questions. > > > >>>>>>> 1. Do we still support "-include-hadoop" profile? If yes, what > we > > > >>>>>>> will > > > >>>>>> get > > > >>>>>>> in the lib dir? > > > >>>>>>> 2. I am not sure whether dropping the flink-shaded-hadoop will > > take > > > >>>>> some > > > >>>>>>> class conflicts > > > >>>>>>> problems. If we use "export HADOOP_CLASSPATH=`hadoop > classpath`" > > > for > > > >>>>> the > > > >>>>>>> hadoop > > > >>>>>>> env setup, then many jars will be appended to the Flink client > > > >>>>> classpath. > > > >>>>>>> 3. The compile hadoop version is still 2.4.1. Right? > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> Best, > > > >>>>>>> Yang > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> Sivaprasanna <sivaprasanna...@gmail.com> 于2020年4月22日周三 > > > >>>>>>> 下午4:18写道: > > > >>>>>>> > > > >>>>>>>> I agree with Aljoscha. Otherwise I can see a lot of tickets > > > getting > > > >>>>>> created > > > >>>>>>>> saying the application is not running on YARN. > > > >>>>>>>> > > > >>>>>>>> Cheers, > > > >>>>>>>> Sivaprasanna > > > >>>>>>>> > > > >>>>>>>> On Wed, Apr 22, 2020 at 1:00 PM Aljoscha Krettek > > > >>>>>>>> <aljos...@apache.org > > > >>>>>>>> wrote: > > > >>>>>>>> > > > >>>>>>>>> +1 to getting rid of flink-shaded-hadoop. But we need to > > > >>>>>>>>> document how > > > >>>>>>>>> people can now get a Flink dist that works with Hadoop. > > > Currently, > > > >>>>> when > > > >>>>>>>>> you download the single shaded jar you immediately get > support > > > for > > > >>>>>>>>> submitting to YARN via bin/flink run. > > > >>>>>>>>> > > > >>>>>>>>> Aljoscha > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> On 22.04.20 09:08, Till Rohrmann wrote: > > > >>>>>>>>>> Hi Robert, > > > >>>>>>>>>> > > > >>>>>>>>>> I think it would be a helpful simplification of Flink's > build > > > >>>>>>>>>> setup > > > >>>>> if > > > >>>>>>>> we > > > >>>>>>>>>> can get rid of flink-shaded-hadoop. Moreover relying only on > > the > > > >>>>>>>> vanilla > > > >>>>>>>>>> Hadoop dependencies for the modules which interact with > > > >>>>>>>>>> Hadoop/Yarn > > > >>>>>>>>> sounds > > > >>>>>>>>>> like a good idea to me. > > > >>>>>>>>>> > > > >>>>>>>>>> Adding support for Hadoop 3 would also be nice. I'm not > sure, > > > >>>>> though, > > > >>>>>>>> how > > > >>>>>>>>>> Hadoop's API's have changed between 2 and 3. It might be > > > necessary > > > >>>>> to > > > >>>>>>>>>> introduce some bridges in order to make it work. > > > >>>>>>>>>> > > > >>>>>>>>>> Cheers, > > > >>>>>>>>>> Till > > > >>>>>>>>>> > > > >>>>>>>>>> On Tue, Apr 21, 2020 at 4:37 PM Robert Metzger > > > >>>>>>>>>> <rmetz...@apache.org > > > >>>>>>>>> wrote: > > > >>>>>>>>>>> Hi all, > > > >>>>>>>>>>> > > > >>>>>>>>>>> for the upcoming 1.11 release, I started looking into > adding > > > >>>>> support > > > >>>>>>>> for > > > >>>>>>>>>>> Hadoop 3[1] for Flink. I have explored a little bit already > > > into > > > >>>>>>>> adding > > > >>>>>>>>> a > > > >>>>>>>>>>> shaded hadoop 3 into “flink-shaded”, and some mechanisms > for > > > >>>>>> switching > > > >>>>>>>>>>> between Hadoop 2 and 3 dependencies in the Flink build. > > > >>>>>>>>>>> > > > >>>>>>>>>>> However, Chesnay made me aware that we could also go a > > > different > > > >>>>>>>> route: > > > >>>>>>>>> We > > > >>>>>>>>>>> let Flink depend on vanilla Hadoop dependencies and stop > > > >>>>>>>>>>> providing > > > >>>>>>>>> shaded > > > >>>>>>>>>>> fat jars for Hadoop through “flink-shaded”. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Why? > > > >>>>>>>>>>> - Maintaining properly shaded Hadoop fat jars is a lot of > > work > > > >>>>>>>>>>> (we > > > >>>>>>>> have > > > >>>>>>>>>>> insufficient test coverage for all kinds of Hadoop > features) > > > >>>>>>>>>>> - For Hadoop 2, there are already some known and unresolved > > > >>>>>>>>>>> issues > > > >>>>>>>> with > > > >>>>>>>>> our > > > >>>>>>>>>>> shaded jars that we didn’t manage to fix > > > >>>>>>>>>>> > > > >>>>>>>>>>> Users will have to use Flink with Hadoop by relying on > > vanilla > > > or > > > >>>>>>>>>>> vendor-provided Hadoop dependencies. > > > >>>>>>>>>>> > > > >>>>>>>>>>> What do you think? > > > >>>>>>>>>>> > > > >>>>>>>>>>> Best, > > > >>>>>>>>>>> Robert > > > >>>>>>>>>>> > > > >>>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-11086 > > > >>>>>>>>>>> > > > >>> > > > >> > > > > > > > > > > > -- > > Konstantin Knauf > > https://twitter.com/snntrable > > https://github.com/knaufk >