Hi Chesnay, I've created FLINK-13958
<https://issues.apache.org/jira/browse/FLINK-13958> to track the issue.

Thanks,
D.

On Wed, Sep 4, 2019 at 1:56 PM Chesnay Schepler <ches...@apache.org> wrote:

> This sounds like a serious bug, please open a JIRA ticket.
>
> On 04/09/2019 13:41, David Morávek wrote:
> > Hi,
> >
> > we're testing the newly released batch recovery and are running into
> class
> > loading related issues.
> >
> > 1) We have a per-job flink cluster
> > 2) We use BATCH execution mode + region failover strategy
> >
> > Point 1) should imply single user code class loader per task manager
> > (because there is only single pipeline, that reuses class loader cached
> in
> > BlobLibraryCacheManager). We need this property, because we have UDFs
> that
> > access C libraries using JNI (I think this may be fairly common use-case
> > when dealing with legacy code). JDK internals
> > <
> https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466
> >
> > make sure that single library can be only loaded by a single class loader
> > per JVM.
> >
> > When region recovery is triggered, vertices that need recover are first
> > reset back to CREATED stated and then rescheduled. In case all tasks in a
> > task manager are reset, this results in cached class loader being
> released
> > <
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338
> >.
> > This unfortunately causes job failure, because we try to reload a native
> > library in a newly created class loader.
> >
> > I know that there is always possibility to distribute native libraries
> with
> > flink's libs and load it using system class loader, but this introduces a
> > build & operations overhead and just make it really unfriendly for
> cluster
> > user, so I'd rather not work around the issue this way (per-job cluster
> > should be more user friendly).
> >
> > I believe the correct approach would be not to release cached class
> loader
> > if the job is recovering, even though there are no tasks currently
> > registered with TM.
> >
> > What do you think? Thanks for help.
> >
> > D.
> >
>
>

Reply via email to