Thanks for reporting the problem David! Cheers, Fabian
Am Mi., 4. Sept. 2019 um 14:09 Uhr schrieb David Morávek <d...@apache.org>: > Hi Chesnay, I've created FLINK-13958 > <https://issues.apache.org/jira/browse/FLINK-13958> to track the issue. > > Thanks, > D. > > On Wed, Sep 4, 2019 at 1:56 PM Chesnay Schepler <ches...@apache.org> > wrote: > > > This sounds like a serious bug, please open a JIRA ticket. > > > > On 04/09/2019 13:41, David Morávek wrote: > > > Hi, > > > > > > we're testing the newly released batch recovery and are running into > > class > > > loading related issues. > > > > > > 1) We have a per-job flink cluster > > > 2) We use BATCH execution mode + region failover strategy > > > > > > Point 1) should imply single user code class loader per task manager > > > (because there is only single pipeline, that reuses class loader cached > > in > > > BlobLibraryCacheManager). We need this property, because we have UDFs > > that > > > access C libraries using JNI (I think this may be fairly common > use-case > > > when dealing with legacy code). JDK internals > > > < > > > https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466 > > > > > > make sure that single library can be only loaded by a single class > loader > > > per JVM. > > > > > > When region recovery is triggered, vertices that need recover are first > > > reset back to CREATED stated and then rescheduled. In case all tasks > in a > > > task manager are reset, this results in cached class loader being > > released > > > < > > > https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338 > > >. > > > This unfortunately causes job failure, because we try to reload a > native > > > library in a newly created class loader. > > > > > > I know that there is always possibility to distribute native libraries > > with > > > flink's libs and load it using system class loader, but this > introduces a > > > build & operations overhead and just make it really unfriendly for > > cluster > > > user, so I'd rather not work around the issue this way (per-job cluster > > > should be more user friendly). > > > > > > I believe the correct approach would be not to release cached class > > loader > > > if the job is recovering, even though there are no tasks currently > > > registered with TM. > > > > > > What do you think? Thanks for help. > > > > > > D. > > > > > > > >