Hey, Thanks so much for the help! This is awesome. I'll start looking into all of this right away and report back.
Best, Aaron Levin On Mon, Jan 21, 2019 at 5:16 PM Ufuk Celebi <u...@apache.org> wrote: > Hey Aaron, > > sorry for the late reply. > > (1) I think I was able to reproduce this issue using snappy-java. I've > filed a ticket here: > https://issues.apache.org/jira/browse/FLINK-11402. Can you check the > ticket description whether it's in line with what you are > experiencing? Most importantly, do you see the same Exception being > reported after cancelling and re-starting the job? > > (2) I don't think it's caused by the environment options not being > picked up. You can check the head of the log files of the JobManager > or TaskManager to verify that your provided option is picked up as > expected. You should see something similar to this: > > 2019-01-21 22:53:49,863 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > > -------------------------------------------------------------------------------- > 2019-01-21 22:53:49,864 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > Starting StandaloneSessionClusterEntrypoint (Version: 1.7.0, > Rev:49da9f9, Date:28.11.2018 @ 17:59:06 UTC) > ... > 2019-01-21 22:53:49,865 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM > Options: > 2019-01-21 22:53:49,865 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > -Xms1024m > 2019-01-21 22:53:49,865 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > -Xmx1024m > You are looking for this line ----> 2019-01-21 22:53:49,865 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64/ <---- > 2019-01-21 22:53:49,865 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > -Dlog.file=/.../flink-1.7.0/log/flink-standalonesession-0.local.log > ... > 2019-01-21 22:53:49,866 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > Program Arguments: > 2019-01-21 22:53:49,866 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > --configDir > 2019-01-21 22:53:49,866 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > /.../flink-1.7.0/conf > 2019-01-21 22:53:49,866 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > --executionMode > 2019-01-21 22:53:49,866 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > cluster > ... > 2019-01-21 22:53:49,866 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > > -------------------------------------------------------------------------------- > > Can you verify that you see the log messages as expected? > > (3) As noted FLINK-11402, is it possible to package the snappy library > as part of your user code instead of loading the library via > java.library.path? In my example, that seems to work fine. > > – Ufuk > > On Thu, Jan 17, 2019 at 5:53 PM Aaron Levin <aaronle...@stripe.com> wrote: > > > > Hello! > > > > *tl;dr*: settings in `env.java.opts` seem to stop having impact when a > job is canceled or fails and then is restarted (with or without > savepoint/checkpoints). If I restart the task-managers, the `env.java.opts` > seem to start having impact again and our job will run without failure. > More below. > > > > We use consume Snappy-compressed sequence files in our flink job. This > requires access to the hadoop native libraries. In our `flink-conf.yaml` > for both the task manager and the job manager, we put: > > > > ``` > > env.java.opts: -Djava.library.path=/usr/local/hadoop/lib/native > > ``` > > > > If I launch our job on freshly-restarted task managers, the job operates > fine. If at some point I cancel the job or if the job restarts for some > other reason, the job will begin to crashloop because it tries to open a > Snappy-compressed file but doesn't have access to the codec from the native > hadoop libraries in `/usr/local/hadoop/lib/native`. If I then restart the > task manager while the job is crashlooping, the job is start running > without any codec failures. > > > > The only reason I can conjure that would cause the Snappy compression to > fail is if the `env.java.opts` were not being passed through to the job on > restart for some reason. > > > > Does anyone know what's going on? Am I missing some additional > configuration? I really appreciate any help! > > > > About our setup: > > > > - Flink Version: 1.7.0 > > - Deployment: Standalone in HA > > - Hadoop/S3 setup: we do *not* set `HADOOP_CLASSPATH`. We use Flink’s > shaded jars to access our files in S3. We do not use the > `bundled-with-hadoop` distribution of Flink. > > > > Best, > > > > Aaron Levin >