Re: `env.java.opts` not persisting after job canceled or failed and then restarted

Aaron Levin Thu, 24 Jan 2019 15:13:02 -0800

Hi Ufuk,

Update: I've pinned down the issue. It's multiple classloaders loading
`libhadoop.so`:


```
failed to load native hadoop with error: java.lang.UnsatisfiedLinkError:
Native Library /usr/lib/libhadoop.so already loaded in another classloader
```

I'm not quite sure what the solution is. Ideally flink would destroy a
classloader when a job is canceled, but perhaps there's a jvm limitation
there? Putting the libraries into `/usr/lib` or `/lib` does not work (as
suggested by Chesnay in the ticket) as I get the same error. I might see if
I can put a jar with `org.apache.hadoop.common.io.compress` in
`/flink/install/lib` and then remove it from my jar. It's not an ideal
solution but I can't think of anything else.

Best,

Aaron Levin

On Thu, Jan 24, 2019 at 12:18 PM Aaron Levin <aaronle...@stripe.com> wrote:

> Hi Ufuk,
>
> I'm starting to believe the bug is much deeper than the originally
> reported error because putting the libraries in `/usr/lib` or `/lib` does
> not work. This morning I dug into why putting `libhadoop.so` into
> `/usr/lib` didn't work, despite that being in the `java.library.path` at
> the call site of the error. I wrote a small program to test the loading of
> native libraries, and it was able to successfully load `libhadoop.so`. I'm
> very perplexed. Could this be related to the way flink shades hadoop stuff?
>
> Here is my program and its output:
>
> ```
> $ cat LibTest.scala
> package com.redacted.flink
>
> object LibTest {
>   def main(args: Array[String]): Unit = {
>     val library = args(0)
>
> System.out.println(s"java.library.path=${System.getProperty("java.library.path")}")
>     System.out.println(s"Attempting to load $library")
>     System.out.flush()
>     System.loadLibrary(library)
>     System.out.println(s"Successfully loaded ")
>     System.out.flush()
> }
> ```
>
> I then tried running that on one of the task managers with `hadoop` as an
> argument:
>
> ```
> $ java -jar lib_test_deploy.jar hadoop
>
> java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
> Attempting to load hadoop
> Exception in thread "main" java.lang.UnsatisfiedLinkError: no hadoop in
> java.library.path
> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
> at java.lang.Runtime.loadLibrary0(Runtime.java:870)
> at java.lang.System.loadLibrary(System.java:1122)
> at com.stripe.flink.LibTest$.main(LibTest.scala:11)
> at com.stripe.flink.LibTest.main(LibTest.scala)
> ```
>
> I then copied the native libraries into `/usr/lib/` and ran it again:
>
> ```
> $ sudo cp /usr/local/hadoop/lib/native/libhadoop.so /usr/lib/
> $ java -jar lib_test_deploy.jar hadoop
>
> java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
> Attempting to load hadoop
> Successfully loaded
> ```
>
> Any ideas?
>
> On Wed, Jan 23, 2019 at 7:13 PM Aaron Levin <aaronle...@stripe.com> wrote:
>
>> Hi Ufuk,
>>
>> One more update: I tried copying all the hadoop native `.so` files
>> (mainly `libhadoop.so`) into `/lib` and am I still experiencing the issue I
>> reported. I also tried naively adding the `.so` files to the jar with the
>> flink application and am still experiencing the issue I reported (however,
>> I'm going to investigate this further as I might not have done it
>> correctly).
>>
>> Best,
>>
>> Aaron Levin
>>
>> On Wed, Jan 23, 2019 at 3:18 PM Aaron Levin <aaronle...@stripe.com>
>> wrote:
>>
>>> Hi Ufuk,
>>>
>>> Two updates:
>>>
>>> 1. As suggested in the ticket, I naively copied the every `.so` in
>>> `hadoop-3.0.0/lib/native/` into `/lib/` and this did not seem to help. My
>>> knowledge of how shared libs get picked up is hazy, so I'm not sure if
>>> blindly copying them like that should work. I did check what
>>> `System.getProperty("java.library.path")` returns at the call-site and
>>> it's: 
>>> java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
>>> 2. The exception I see comes from
>>> `hadoop.util.NativeCodeLoader.buildSupportsSnappy` (stack-trace below).
>>> This uses `System.loadLibrary("hadoop")`.
>>>
>>> [2019-01-23 19:52:33.081216] java.lang.UnsatisfiedLinkError:
>>> org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
>>> [2019-01-23 19:52:33.081376]  at
>>> org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native Method)
>>> [2019-01-23 19:52:33.081406]  at
>>> org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)
>>> [2019-01-23 19:52:33.081429]  at
>>> org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:195)
>>> [2019-01-23 19:52:33.081457]  at
>>> org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:181)
>>> [2019-01-23 19:52:33.081494]  at
>>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:2037)
>>> [2019-01-23 19:52:33.081517]  at
>>> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923)
>>> [2019-01-23 19:52:33.081549]  at
>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872)
>>> ... (redacted) ...
>>> [2019-01-23 19:52:33.081728]  at
>>> scala.collection.immutable.List.foreach(List.scala:392)
>>> ... (redacted) ...
>>> [2019-01-23 19:52:33.081832]  at
>>> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94)
>>> [2019-01-23 19:52:33.081854]  at
>>> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:58)
>>> [2019-01-23 19:52:33.081882]  at
>>> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
>>> [2019-01-23 19:52:33.081904]  at
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
>>> [2019-01-23 19:52:33.081946]  at
>>> org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>> [2019-01-23 19:52:33.081967]  at java.lang.Thread.run(Thread.java:748)
>>>
>>> On Tue, Jan 22, 2019 at 2:31 PM Aaron Levin <aaronle...@stripe.com>
>>> wrote:
>>>
>>>> Hey Ufuk,
>>>>
>>>> So, I looked into this a little bit:
>>>>
>>>> 1. clarification: my issues are with the hadoop-related snappy
>>>> libraries and not libsnappy itself (this is my bad for not being clearer,
>>>> sorry!). I already have `libsnappy` on my classpath, but I am looking into
>>>> including the hadoop snappy libraries.
>>>> 2. exception: I don't see the class loading error. I'm going to try to
>>>> put some more instrumentation and see if I can get a clearer stacktrace
>>>> (right now I get an NPE on closing a sequence file in a finalizer - when I
>>>> last logged the exception it was something deep in hadoop's snappy libs -
>>>> I'll get clarification soon).
>>>> 3. I'm looking into including hadoop's snappy libs in my jar and we'll
>>>> see if that resolves the problem.
>>>>
>>>> Thanks again for your help!
>>>>
>>>> Best,
>>>>
>>>> Aaron Levin
>>>>
>>>> On Tue, Jan 22, 2019 at 10:47 AM Aaron Levin <aaronle...@stripe.com>
>>>> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> Thanks so much for the help! This is awesome. I'll start looking into
>>>>> all of this right away and report back.
>>>>>
>>>>> Best,
>>>>>
>>>>> Aaron Levin
>>>>>
>>>>> On Mon, Jan 21, 2019 at 5:16 PM Ufuk Celebi <u...@apache.org> wrote:
>>>>>
>>>>>> Hey Aaron,
>>>>>>
>>>>>> sorry for the late reply.
>>>>>>
>>>>>> (1) I think I was able to reproduce this issue using snappy-java. I've
>>>>>> filed a ticket here:
>>>>>> https://issues.apache.org/jira/browse/FLINK-11402. Can you check the
>>>>>> ticket description whether it's in line with what you are
>>>>>> experiencing? Most importantly, do you see the same Exception being
>>>>>> reported after cancelling and re-starting the job?
>>>>>>
>>>>>> (2) I don't think it's caused by the environment options not being
>>>>>> picked up. You can check the head of the log files of the JobManager
>>>>>> or TaskManager to verify that your provided option is picked up as
>>>>>> expected. You should see something similar to this:
>>>>>>
>>>>>> 2019-01-21 22:53:49,863 INFO
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>>
>>>>>> --------------------------------------------------------------------------------
>>>>>> 2019-01-21 22:53:49,864 INFO
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>> Starting StandaloneSessionClusterEntrypoint (Version: 1.7.0,
>>>>>> Rev:49da9f9, Date:28.11.2018 @ 17:59:06 UTC)
>>>>>> ...
>>>>>> 2019-01-21 22:53:49,865 INFO
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM
>>>>>> Options:
>>>>>> 2019-01-21 22:53:49,865 INFO
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>> -Xms1024m
>>>>>> 2019-01-21 22:53:49,865 INFO
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>> -Xmx1024m
>>>>>> You are looking for this line ----> 2019-01-21 22:53:49,865 INFO
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>> -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64/ <----
>>>>>> 2019-01-21 22:53:49,865 INFO
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>> -Dlog.file=/.../flink-1.7.0/log/flink-standalonesession-0.local.log
>>>>>> ...
>>>>>> 2019-01-21 22:53:49,866 INFO
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>> Program Arguments:
>>>>>> 2019-01-21 22:53:49,866 INFO
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>> --configDir
>>>>>> 2019-01-21 22:53:49,866 INFO
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>> /.../flink-1.7.0/conf
>>>>>> 2019-01-21 22:53:49,866 INFO
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>> --executionMode
>>>>>> 2019-01-21 22:53:49,866 INFO
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>> cluster
>>>>>> ...
>>>>>> 2019-01-21 22:53:49,866 INFO
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>>
>>>>>> --------------------------------------------------------------------------------
>>>>>>
>>>>>> Can you verify that you see the log messages as expected?
>>>>>>
>>>>>> (3) As noted FLINK-11402, is it possible to package the snappy library
>>>>>> as part of your user code instead of loading the library via
>>>>>> java.library.path? In my example, that seems to work fine.
>>>>>>
>>>>>> – Ufuk
>>>>>>
>>>>>> On Thu, Jan 17, 2019 at 5:53 PM Aaron Levin <aaronle...@stripe.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > Hello!
>>>>>> >
>>>>>> > *tl;dr*: settings in `env.java.opts` seem to stop having impact
>>>>>> when a job is canceled or fails and then is restarted (with or without
>>>>>> savepoint/checkpoints). If I restart the task-managers, the 
>>>>>> `env.java.opts`
>>>>>> seem to start having impact again and our job will run without failure.
>>>>>> More below.
>>>>>> >
>>>>>> > We use consume Snappy-compressed sequence files in our flink job.
>>>>>> This requires access to the hadoop native libraries. In our
>>>>>> `flink-conf.yaml` for both the task manager and the job manager, we put:
>>>>>> >
>>>>>> > ```
>>>>>> > env.java.opts: -Djava.library.path=/usr/local/hadoop/lib/native
>>>>>> > ```
>>>>>> >
>>>>>> > If I launch our job on freshly-restarted task managers, the job
>>>>>> operates fine. If at some point I cancel the job or if the job restarts 
>>>>>> for
>>>>>> some other reason, the job will begin to crashloop because it tries to 
>>>>>> open
>>>>>> a Snappy-compressed file but doesn't have access to the codec from the
>>>>>> native hadoop libraries in `/usr/local/hadoop/lib/native`. If I then
>>>>>> restart the task manager while the job is crashlooping, the job is start
>>>>>> running without any codec failures.
>>>>>> >
>>>>>> > The only reason I can conjure that would cause the Snappy
>>>>>> compression to fail is if the `env.java.opts` were not being passed 
>>>>>> through
>>>>>> to the job on restart for some reason.
>>>>>> >
>>>>>> > Does anyone know what's going on? Am I missing some additional
>>>>>> configuration? I really appreciate any help!
>>>>>> >
>>>>>> > About our setup:
>>>>>> >
>>>>>> > - Flink Version: 1.7.0
>>>>>> > - Deployment: Standalone in HA
>>>>>> > - Hadoop/S3 setup: we do *not* set `HADOOP_CLASSPATH`. We use
>>>>>> Flink’s shaded jars to access our files in S3. We do not use the
>>>>>> `bundled-with-hadoop` distribution of Flink.
>>>>>> >
>>>>>> > Best,
>>>>>> >
>>>>>> > Aaron Levin
>>>>>>
>>>>>

Re: `env.java.opts` not persisting after job canceled or failed and then restarted

Reply via email to