Re: `env.java.opts` not persisting after job canceled or failed and then restarted

Aaron Levin Wed, 23 Jan 2019 12:18:32 -0800

Hi Ufuk,

Two updates:


1. As suggested in the ticket, I naively copied the every `.so` in
`hadoop-3.0.0/lib/native/` into `/lib/` and this did not seem to help. My
knowledge of how shared libs get picked up is hazy, so I'm not sure if
blindly copying them like that should work. I did check what
`System.getProperty("java.library.path")` returns at the call-site and
it's: 
java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2. The exception I see comes from
`hadoop.util.NativeCodeLoader.buildSupportsSnappy` (stack-trace below).
This uses `System.loadLibrary("hadoop")`.

[2019-01-23 19:52:33.081216] java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
[2019-01-23 19:52:33.081376]  at
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native Method)
[2019-01-23 19:52:33.081406]  at
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)
[2019-01-23 19:52:33.081429]  at
org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:195)
[2019-01-23 19:52:33.081457]  at
org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:181)
[2019-01-23 19:52:33.081494]  at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:2037)
[2019-01-23 19:52:33.081517]  at
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923)
[2019-01-23 19:52:33.081549]  at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872)
... (redacted) ...
[2019-01-23 19:52:33.081728]  at
scala.collection.immutable.List.foreach(List.scala:392)
... (redacted) ...
[2019-01-23 19:52:33.081832]  at
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94)
[2019-01-23 19:52:33.081854]  at
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:58)
[2019-01-23 19:52:33.081882]  at
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
[2019-01-23 19:52:33.081904]  at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
[2019-01-23 19:52:33.081946]  at
org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
[2019-01-23 19:52:33.081967]  at java.lang.Thread.run(Thread.java:748)

On Tue, Jan 22, 2019 at 2:31 PM Aaron Levin <aaronle...@stripe.com> wrote:

> Hey Ufuk,
>
> So, I looked into this a little bit:
>
> 1. clarification: my issues are with the hadoop-related snappy libraries
> and not libsnappy itself (this is my bad for not being clearer, sorry!). I
> already have `libsnappy` on my classpath, but I am looking into including
> the hadoop snappy libraries.
> 2. exception: I don't see the class loading error. I'm going to try to put
> some more instrumentation and see if I can get a clearer stacktrace (right
> now I get an NPE on closing a sequence file in a finalizer - when I last
> logged the exception it was something deep in hadoop's snappy libs - I'll
> get clarification soon).
> 3. I'm looking into including hadoop's snappy libs in my jar and we'll see
> if that resolves the problem.
>
> Thanks again for your help!
>
> Best,
>
> Aaron Levin
>
> On Tue, Jan 22, 2019 at 10:47 AM Aaron Levin <aaronle...@stripe.com>
> wrote:
>
>> Hey,
>>
>> Thanks so much for the help! This is awesome. I'll start looking into all
>> of this right away and report back.
>>
>> Best,
>>
>> Aaron Levin
>>
>> On Mon, Jan 21, 2019 at 5:16 PM Ufuk Celebi <u...@apache.org> wrote:
>>
>>> Hey Aaron,
>>>
>>> sorry for the late reply.
>>>
>>> (1) I think I was able to reproduce this issue using snappy-java. I've
>>> filed a ticket here:
>>> https://issues.apache.org/jira/browse/FLINK-11402. Can you check the
>>> ticket description whether it's in line with what you are
>>> experiencing? Most importantly, do you see the same Exception being
>>> reported after cancelling and re-starting the job?
>>>
>>> (2) I don't think it's caused by the environment options not being
>>> picked up. You can check the head of the log files of the JobManager
>>> or TaskManager to verify that your provided option is picked up as
>>> expected. You should see something similar to this:
>>>
>>> 2019-01-21 22:53:49,863 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>
>>> --------------------------------------------------------------------------------
>>> 2019-01-21 22:53:49,864 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>> Starting StandaloneSessionClusterEntrypoint (Version: 1.7.0,
>>> Rev:49da9f9, Date:28.11.2018 @ 17:59:06 UTC)
>>> ...
>>> 2019-01-21 22:53:49,865 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM
>>> Options:
>>> 2019-01-21 22:53:49,865 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>> -Xms1024m
>>> 2019-01-21 22:53:49,865 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>> -Xmx1024m
>>> You are looking for this line ----> 2019-01-21 22:53:49,865 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>> -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64/ <----
>>> 2019-01-21 22:53:49,865 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>> -Dlog.file=/.../flink-1.7.0/log/flink-standalonesession-0.local.log
>>> ...
>>> 2019-01-21 22:53:49,866 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>> Program Arguments:
>>> 2019-01-21 22:53:49,866 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>> --configDir
>>> 2019-01-21 22:53:49,866 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>> /.../flink-1.7.0/conf
>>> 2019-01-21 22:53:49,866 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>> --executionMode
>>> 2019-01-21 22:53:49,866 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>> cluster
>>> ...
>>> 2019-01-21 22:53:49,866 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>
>>> --------------------------------------------------------------------------------
>>>
>>> Can you verify that you see the log messages as expected?
>>>
>>> (3) As noted FLINK-11402, is it possible to package the snappy library
>>> as part of your user code instead of loading the library via
>>> java.library.path? In my example, that seems to work fine.
>>>
>>> – Ufuk
>>>
>>> On Thu, Jan 17, 2019 at 5:53 PM Aaron Levin <aaronle...@stripe.com>
>>> wrote:
>>> >
>>> > Hello!
>>> >
>>> > *tl;dr*: settings in `env.java.opts` seem to stop having impact when a
>>> job is canceled or fails and then is restarted (with or without
>>> savepoint/checkpoints). If I restart the task-managers, the `env.java.opts`
>>> seem to start having impact again and our job will run without failure.
>>> More below.
>>> >
>>> > We use consume Snappy-compressed sequence files in our flink job. This
>>> requires access to the hadoop native libraries. In our `flink-conf.yaml`
>>> for both the task manager and the job manager, we put:
>>> >
>>> > ```
>>> > env.java.opts: -Djava.library.path=/usr/local/hadoop/lib/native
>>> > ```
>>> >
>>> > If I launch our job on freshly-restarted task managers, the job
>>> operates fine. If at some point I cancel the job or if the job restarts for
>>> some other reason, the job will begin to crashloop because it tries to open
>>> a Snappy-compressed file but doesn't have access to the codec from the
>>> native hadoop libraries in `/usr/local/hadoop/lib/native`. If I then
>>> restart the task manager while the job is crashlooping, the job is start
>>> running without any codec failures.
>>> >
>>> > The only reason I can conjure that would cause the Snappy compression
>>> to fail is if the `env.java.opts` were not being passed through to the job
>>> on restart for some reason.
>>> >
>>> > Does anyone know what's going on? Am I missing some additional
>>> configuration? I really appreciate any help!
>>> >
>>> > About our setup:
>>> >
>>> > - Flink Version: 1.7.0
>>> > - Deployment: Standalone in HA
>>> > - Hadoop/S3 setup: we do *not* set `HADOOP_CLASSPATH`. We use Flink’s
>>> shaded jars to access our files in S3. We do not use the
>>> `bundled-with-hadoop` distribution of Flink.
>>> >
>>> > Best,
>>> >
>>> > Aaron Levin
>>>
>>

Re: `env.java.opts` not persisting after job canceled or failed and then restarted

Reply via email to