Hi Ufuk, I'm starting to believe the bug is much deeper than the originally reported error because putting the libraries in `/usr/lib` or `/lib` does not work. This morning I dug into why putting `libhadoop.so` into `/usr/lib` didn't work, despite that being in the `java.library.path` at the call site of the error. I wrote a small program to test the loading of native libraries, and it was able to successfully load `libhadoop.so`. I'm very perplexed. Could this be related to the way flink shades hadoop stuff?
Here is my program and its output: ``` $ cat LibTest.scala package com.redacted.flink object LibTest { def main(args: Array[String]): Unit = { val library = args(0) System.out.println(s"java.library.path=${System.getProperty("java.library.path")}") System.out.println(s"Attempting to load $library") System.out.flush() System.loadLibrary(library) System.out.println(s"Successfully loaded ") System.out.flush() } ``` I then tried running that on one of the task managers with `hadoop` as an argument: ``` $ java -jar lib_test_deploy.jar hadoop java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib Attempting to load hadoop Exception in thread "main" java.lang.UnsatisfiedLinkError: no hadoop in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867) at java.lang.Runtime.loadLibrary0(Runtime.java:870) at java.lang.System.loadLibrary(System.java:1122) at com.stripe.flink.LibTest$.main(LibTest.scala:11) at com.stripe.flink.LibTest.main(LibTest.scala) ``` I then copied the native libraries into `/usr/lib/` and ran it again: ``` $ sudo cp /usr/local/hadoop/lib/native/libhadoop.so /usr/lib/ $ java -jar lib_test_deploy.jar hadoop java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib Attempting to load hadoop Successfully loaded ``` Any ideas? On Wed, Jan 23, 2019 at 7:13 PM Aaron Levin <aaronle...@stripe.com> wrote: > Hi Ufuk, > > One more update: I tried copying all the hadoop native `.so` files (mainly > `libhadoop.so`) into `/lib` and am I still experiencing the issue I > reported. I also tried naively adding the `.so` files to the jar with the > flink application and am still experiencing the issue I reported (however, > I'm going to investigate this further as I might not have done it > correctly). > > Best, > > Aaron Levin > > On Wed, Jan 23, 2019 at 3:18 PM Aaron Levin <aaronle...@stripe.com> wrote: > >> Hi Ufuk, >> >> Two updates: >> >> 1. As suggested in the ticket, I naively copied the every `.so` in >> `hadoop-3.0.0/lib/native/` into `/lib/` and this did not seem to help. My >> knowledge of how shared libs get picked up is hazy, so I'm not sure if >> blindly copying them like that should work. I did check what >> `System.getProperty("java.library.path")` returns at the call-site and >> it's: >> java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib >> 2. The exception I see comes from >> `hadoop.util.NativeCodeLoader.buildSupportsSnappy` (stack-trace below). >> This uses `System.loadLibrary("hadoop")`. >> >> [2019-01-23 19:52:33.081216] java.lang.UnsatisfiedLinkError: >> org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z >> [2019-01-23 19:52:33.081376] at >> org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native Method) >> [2019-01-23 19:52:33.081406] at >> org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63) >> [2019-01-23 19:52:33.081429] at >> org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:195) >> [2019-01-23 19:52:33.081457] at >> org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:181) >> [2019-01-23 19:52:33.081494] at >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:2037) >> [2019-01-23 19:52:33.081517] at >> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923) >> [2019-01-23 19:52:33.081549] at >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872) >> ... (redacted) ... >> [2019-01-23 19:52:33.081728] at >> scala.collection.immutable.List.foreach(List.scala:392) >> ... (redacted) ... >> [2019-01-23 19:52:33.081832] at >> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94) >> [2019-01-23 19:52:33.081854] at >> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:58) >> [2019-01-23 19:52:33.081882] at >> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99) >> [2019-01-23 19:52:33.081904] at >> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) >> [2019-01-23 19:52:33.081946] at >> org.apache.flink.runtime.taskmanager.Task.run(Task.java:704) >> [2019-01-23 19:52:33.081967] at java.lang.Thread.run(Thread.java:748) >> >> On Tue, Jan 22, 2019 at 2:31 PM Aaron Levin <aaronle...@stripe.com> >> wrote: >> >>> Hey Ufuk, >>> >>> So, I looked into this a little bit: >>> >>> 1. clarification: my issues are with the hadoop-related snappy libraries >>> and not libsnappy itself (this is my bad for not being clearer, sorry!). I >>> already have `libsnappy` on my classpath, but I am looking into including >>> the hadoop snappy libraries. >>> 2. exception: I don't see the class loading error. I'm going to try to >>> put some more instrumentation and see if I can get a clearer stacktrace >>> (right now I get an NPE on closing a sequence file in a finalizer - when I >>> last logged the exception it was something deep in hadoop's snappy libs - >>> I'll get clarification soon). >>> 3. I'm looking into including hadoop's snappy libs in my jar and we'll >>> see if that resolves the problem. >>> >>> Thanks again for your help! >>> >>> Best, >>> >>> Aaron Levin >>> >>> On Tue, Jan 22, 2019 at 10:47 AM Aaron Levin <aaronle...@stripe.com> >>> wrote: >>> >>>> Hey, >>>> >>>> Thanks so much for the help! This is awesome. I'll start looking into >>>> all of this right away and report back. >>>> >>>> Best, >>>> >>>> Aaron Levin >>>> >>>> On Mon, Jan 21, 2019 at 5:16 PM Ufuk Celebi <u...@apache.org> wrote: >>>> >>>>> Hey Aaron, >>>>> >>>>> sorry for the late reply. >>>>> >>>>> (1) I think I was able to reproduce this issue using snappy-java. I've >>>>> filed a ticket here: >>>>> https://issues.apache.org/jira/browse/FLINK-11402. Can you check the >>>>> ticket description whether it's in line with what you are >>>>> experiencing? Most importantly, do you see the same Exception being >>>>> reported after cancelling and re-starting the job? >>>>> >>>>> (2) I don't think it's caused by the environment options not being >>>>> picked up. You can check the head of the log files of the JobManager >>>>> or TaskManager to verify that your provided option is picked up as >>>>> expected. You should see something similar to this: >>>>> >>>>> 2019-01-21 22:53:49,863 INFO >>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>> >>>>> -------------------------------------------------------------------------------- >>>>> 2019-01-21 22:53:49,864 INFO >>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>> Starting StandaloneSessionClusterEntrypoint (Version: 1.7.0, >>>>> Rev:49da9f9, Date:28.11.2018 @ 17:59:06 UTC) >>>>> ... >>>>> 2019-01-21 22:53:49,865 INFO >>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM >>>>> Options: >>>>> 2019-01-21 22:53:49,865 INFO >>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>> -Xms1024m >>>>> 2019-01-21 22:53:49,865 INFO >>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>> -Xmx1024m >>>>> You are looking for this line ----> 2019-01-21 22:53:49,865 INFO >>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>> -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64/ <---- >>>>> 2019-01-21 22:53:49,865 INFO >>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>> -Dlog.file=/.../flink-1.7.0/log/flink-standalonesession-0.local.log >>>>> ... >>>>> 2019-01-21 22:53:49,866 INFO >>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>> Program Arguments: >>>>> 2019-01-21 22:53:49,866 INFO >>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>> --configDir >>>>> 2019-01-21 22:53:49,866 INFO >>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>> /.../flink-1.7.0/conf >>>>> 2019-01-21 22:53:49,866 INFO >>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>> --executionMode >>>>> 2019-01-21 22:53:49,866 INFO >>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>> cluster >>>>> ... >>>>> 2019-01-21 22:53:49,866 INFO >>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>> >>>>> -------------------------------------------------------------------------------- >>>>> >>>>> Can you verify that you see the log messages as expected? >>>>> >>>>> (3) As noted FLINK-11402, is it possible to package the snappy library >>>>> as part of your user code instead of loading the library via >>>>> java.library.path? In my example, that seems to work fine. >>>>> >>>>> – Ufuk >>>>> >>>>> On Thu, Jan 17, 2019 at 5:53 PM Aaron Levin <aaronle...@stripe.com> >>>>> wrote: >>>>> > >>>>> > Hello! >>>>> > >>>>> > *tl;dr*: settings in `env.java.opts` seem to stop having impact when >>>>> a job is canceled or fails and then is restarted (with or without >>>>> savepoint/checkpoints). If I restart the task-managers, the >>>>> `env.java.opts` >>>>> seem to start having impact again and our job will run without failure. >>>>> More below. >>>>> > >>>>> > We use consume Snappy-compressed sequence files in our flink job. >>>>> This requires access to the hadoop native libraries. In our >>>>> `flink-conf.yaml` for both the task manager and the job manager, we put: >>>>> > >>>>> > ``` >>>>> > env.java.opts: -Djava.library.path=/usr/local/hadoop/lib/native >>>>> > ``` >>>>> > >>>>> > If I launch our job on freshly-restarted task managers, the job >>>>> operates fine. If at some point I cancel the job or if the job restarts >>>>> for >>>>> some other reason, the job will begin to crashloop because it tries to >>>>> open >>>>> a Snappy-compressed file but doesn't have access to the codec from the >>>>> native hadoop libraries in `/usr/local/hadoop/lib/native`. If I then >>>>> restart the task manager while the job is crashlooping, the job is start >>>>> running without any codec failures. >>>>> > >>>>> > The only reason I can conjure that would cause the Snappy >>>>> compression to fail is if the `env.java.opts` were not being passed >>>>> through >>>>> to the job on restart for some reason. >>>>> > >>>>> > Does anyone know what's going on? Am I missing some additional >>>>> configuration? I really appreciate any help! >>>>> > >>>>> > About our setup: >>>>> > >>>>> > - Flink Version: 1.7.0 >>>>> > - Deployment: Standalone in HA >>>>> > - Hadoop/S3 setup: we do *not* set `HADOOP_CLASSPATH`. We use >>>>> Flink’s shaded jars to access our files in S3. We do not use the >>>>> `bundled-with-hadoop` distribution of Flink. >>>>> > >>>>> > Best, >>>>> > >>>>> > Aaron Levin >>>>> >>>>