Re: `env.java.opts` not persisting after job canceled or failed and then restarted

Aaron Levin Tue, 22 Jan 2019 11:32:42 -0800

Hey Ufuk,

So, I looked into this a little bit:


1. clarification: my issues are with the hadoop-related snappy libraries
and not libsnappy itself (this is my bad for not being clearer, sorry!). I
already have `libsnappy` on my classpath, but I am looking into including
the hadoop snappy libraries.
2. exception: I don't see the class loading error. I'm going to try to put
some more instrumentation and see if I can get a clearer stacktrace (right
now I get an NPE on closing a sequence file in a finalizer - when I last
logged the exception it was something deep in hadoop's snappy libs - I'll
get clarification soon).
3. I'm looking into including hadoop's snappy libs in my jar and we'll see
if that resolves the problem.

Thanks again for your help!

Best,

Aaron Levin

On Tue, Jan 22, 2019 at 10:47 AM Aaron Levin <aaronle...@stripe.com> wrote:

> Hey,
>
> Thanks so much for the help! This is awesome. I'll start looking into all
> of this right away and report back.
>
> Best,
>
> Aaron Levin
>
> On Mon, Jan 21, 2019 at 5:16 PM Ufuk Celebi <u...@apache.org> wrote:
>
>> Hey Aaron,
>>
>> sorry for the late reply.
>>
>> (1) I think I was able to reproduce this issue using snappy-java. I've
>> filed a ticket here:
>> https://issues.apache.org/jira/browse/FLINK-11402. Can you check the
>> ticket description whether it's in line with what you are
>> experiencing? Most importantly, do you see the same Exception being
>> reported after cancelling and re-starting the job?
>>
>> (2) I don't think it's caused by the environment options not being
>> picked up. You can check the head of the log files of the JobManager
>> or TaskManager to verify that your provided option is picked up as
>> expected. You should see something similar to this:
>>
>> 2019-01-21 22:53:49,863 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>
>> --------------------------------------------------------------------------------
>> 2019-01-21 22:53:49,864 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>> Starting StandaloneSessionClusterEntrypoint (Version: 1.7.0,
>> Rev:49da9f9, Date:28.11.2018 @ 17:59:06 UTC)
>> ...
>> 2019-01-21 22:53:49,865 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM
>> Options:
>> 2019-01-21 22:53:49,865 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>> -Xms1024m
>> 2019-01-21 22:53:49,865 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>> -Xmx1024m
>> You are looking for this line ----> 2019-01-21 22:53:49,865 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>> -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64/ <----
>> 2019-01-21 22:53:49,865 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>> -Dlog.file=/.../flink-1.7.0/log/flink-standalonesession-0.local.log
>> ...
>> 2019-01-21 22:53:49,866 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>> Program Arguments:
>> 2019-01-21 22:53:49,866 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>> --configDir
>> 2019-01-21 22:53:49,866 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>> /.../flink-1.7.0/conf
>> 2019-01-21 22:53:49,866 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>> --executionMode
>> 2019-01-21 22:53:49,866 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>> cluster
>> ...
>> 2019-01-21 22:53:49,866 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>
>> --------------------------------------------------------------------------------
>>
>> Can you verify that you see the log messages as expected?
>>
>> (3) As noted FLINK-11402, is it possible to package the snappy library
>> as part of your user code instead of loading the library via
>> java.library.path? In my example, that seems to work fine.
>>
>> – Ufuk
>>
>> On Thu, Jan 17, 2019 at 5:53 PM Aaron Levin <aaronle...@stripe.com>
>> wrote:
>> >
>> > Hello!
>> >
>> > *tl;dr*: settings in `env.java.opts` seem to stop having impact when a
>> job is canceled or fails and then is restarted (with or without
>> savepoint/checkpoints). If I restart the task-managers, the `env.java.opts`
>> seem to start having impact again and our job will run without failure.
>> More below.
>> >
>> > We use consume Snappy-compressed sequence files in our flink job. This
>> requires access to the hadoop native libraries. In our `flink-conf.yaml`
>> for both the task manager and the job manager, we put:
>> >
>> > ```
>> > env.java.opts: -Djava.library.path=/usr/local/hadoop/lib/native
>> > ```
>> >
>> > If I launch our job on freshly-restarted task managers, the job
>> operates fine. If at some point I cancel the job or if the job restarts for
>> some other reason, the job will begin to crashloop because it tries to open
>> a Snappy-compressed file but doesn't have access to the codec from the
>> native hadoop libraries in `/usr/local/hadoop/lib/native`. If I then
>> restart the task manager while the job is crashlooping, the job is start
>> running without any codec failures.
>> >
>> > The only reason I can conjure that would cause the Snappy compression
>> to fail is if the `env.java.opts` were not being passed through to the job
>> on restart for some reason.
>> >
>> > Does anyone know what's going on? Am I missing some additional
>> configuration? I really appreciate any help!
>> >
>> > About our setup:
>> >
>> > - Flink Version: 1.7.0
>> > - Deployment: Standalone in HA
>> > - Hadoop/S3 setup: we do *not* set `HADOOP_CLASSPATH`. We use Flink’s
>> shaded jars to access our files in S3. We do not use the
>> `bundled-with-hadoop` distribution of Flink.
>> >
>> > Best,
>> >
>> > Aaron Levin
>>
>

Re: `env.java.opts` not persisting after job canceled or failed and then restarted

Reply via email to