Hey,

Thanks so much for the help! This is awesome. I'll start looking into all
of this right away and report back.

Best,

Aaron Levin

On Mon, Jan 21, 2019 at 5:16 PM Ufuk Celebi <u...@apache.org> wrote:

> Hey Aaron,
>
> sorry for the late reply.
>
> (1) I think I was able to reproduce this issue using snappy-java. I've
> filed a ticket here:
> https://issues.apache.org/jira/browse/FLINK-11402. Can you check the
> ticket description whether it's in line with what you are
> experiencing? Most importantly, do you see the same Exception being
> reported after cancelling and re-starting the job?
>
> (2) I don't think it's caused by the environment options not being
> picked up. You can check the head of the log files of the JobManager
> or TaskManager to verify that your provided option is picked up as
> expected. You should see something similar to this:
>
> 2019-01-21 22:53:49,863 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>
> --------------------------------------------------------------------------------
> 2019-01-21 22:53:49,864 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> Starting StandaloneSessionClusterEntrypoint (Version: 1.7.0,
> Rev:49da9f9, Date:28.11.2018 @ 17:59:06 UTC)
> ...
> 2019-01-21 22:53:49,865 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM
> Options:
> 2019-01-21 22:53:49,865 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> -Xms1024m
> 2019-01-21 22:53:49,865 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> -Xmx1024m
> You are looking for this line ----> 2019-01-21 22:53:49,865 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64/ <----
> 2019-01-21 22:53:49,865 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> -Dlog.file=/.../flink-1.7.0/log/flink-standalonesession-0.local.log
> ...
> 2019-01-21 22:53:49,866 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> Program Arguments:
> 2019-01-21 22:53:49,866 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> --configDir
> 2019-01-21 22:53:49,866 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> /.../flink-1.7.0/conf
> 2019-01-21 22:53:49,866 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> --executionMode
> 2019-01-21 22:53:49,866 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> cluster
> ...
> 2019-01-21 22:53:49,866 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>
> --------------------------------------------------------------------------------
>
> Can you verify that you see the log messages as expected?
>
> (3) As noted FLINK-11402, is it possible to package the snappy library
> as part of your user code instead of loading the library via
> java.library.path? In my example, that seems to work fine.
>
> – Ufuk
>
> On Thu, Jan 17, 2019 at 5:53 PM Aaron Levin <aaronle...@stripe.com> wrote:
> >
> > Hello!
> >
> > *tl;dr*: settings in `env.java.opts` seem to stop having impact when a
> job is canceled or fails and then is restarted (with or without
> savepoint/checkpoints). If I restart the task-managers, the `env.java.opts`
> seem to start having impact again and our job will run without failure.
> More below.
> >
> > We use consume Snappy-compressed sequence files in our flink job. This
> requires access to the hadoop native libraries. In our `flink-conf.yaml`
> for both the task manager and the job manager, we put:
> >
> > ```
> > env.java.opts: -Djava.library.path=/usr/local/hadoop/lib/native
> > ```
> >
> > If I launch our job on freshly-restarted task managers, the job operates
> fine. If at some point I cancel the job or if the job restarts for some
> other reason, the job will begin to crashloop because it tries to open a
> Snappy-compressed file but doesn't have access to the codec from the native
> hadoop libraries in `/usr/local/hadoop/lib/native`. If I then restart the
> task manager while the job is crashlooping, the job is start running
> without any codec failures.
> >
> > The only reason I can conjure that would cause the Snappy compression to
> fail is if the `env.java.opts` were not being passed through to the job on
> restart for some reason.
> >
> > Does anyone know what's going on? Am I missing some additional
> configuration? I really appreciate any help!
> >
> > About our setup:
> >
> > - Flink Version: 1.7.0
> > - Deployment: Standalone in HA
> > - Hadoop/S3 setup: we do *not* set `HADOOP_CLASSPATH`. We use Flink’s
> shaded jars to access our files in S3. We do not use the
> `bundled-with-hadoop` distribution of Flink.
> >
> > Best,
> >
> > Aaron Levin
>

Reply via email to