Re: Fine grained batch recovery vs. native libraries

Fabian Hueske Thu, 05 Sep 2019 04:19:37 -0700

Thanks for reporting the problem David!

Cheers,
Fabian


Am Mi., 4. Sept. 2019 um 14:09 Uhr schrieb David Morávek <d...@apache.org>:

> Hi Chesnay, I've created FLINK-13958
> <https://issues.apache.org/jira/browse/FLINK-13958> to track the issue.
>
> Thanks,
> D.
>
> On Wed, Sep 4, 2019 at 1:56 PM Chesnay Schepler <ches...@apache.org>
> wrote:
>
> > This sounds like a serious bug, please open a JIRA ticket.
> >
> > On 04/09/2019 13:41, David Morávek wrote:
> > > Hi,
> > >
> > > we're testing the newly released batch recovery and are running into
> > class
> > > loading related issues.
> > >
> > > 1) We have a per-job flink cluster
> > > 2) We use BATCH execution mode + region failover strategy
> > >
> > > Point 1) should imply single user code class loader per task manager
> > > (because there is only single pipeline, that reuses class loader cached
> > in
> > > BlobLibraryCacheManager). We need this property, because we have UDFs
> > that
> > > access C libraries using JNI (I think this may be fairly common
> use-case
> > > when dealing with legacy code). JDK internals
> > > <
> >
> https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466
> > >
> > > make sure that single library can be only loaded by a single class
> loader
> > > per JVM.
> > >
> > > When region recovery is triggered, vertices that need recover are first
> > > reset back to CREATED stated and then rescheduled. In case all tasks
> in a
> > > task manager are reset, this results in cached class loader being
> > released
> > > <
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338
> > >.
> > > This unfortunately causes job failure, because we try to reload a
> native
> > > library in a newly created class loader.
> > >
> > > I know that there is always possibility to distribute native libraries
> > with
> > > flink's libs and load it using system class loader, but this
> introduces a
> > > build & operations overhead and just make it really unfriendly for
> > cluster
> > > user, so I'd rather not work around the issue this way (per-job cluster
> > > should be more user friendly).
> > >
> > > I believe the correct approach would be not to release cached class
> > loader
> > > if the job is recovering, even though there are no tasks currently
> > > registered with TM.
> > >
> > > What do you think? Thanks for help.
> > >
> > > D.
> > >
> >
> >
>

Re: Fine grained batch recovery vs. native libraries

Reply via email to