Hi Ufuk, One more update: I tried copying all the hadoop native `.so` files (mainly `libhadoop.so`) into `/lib` and am I still experiencing the issue I reported. I also tried naively adding the `.so` files to the jar with the flink application and am still experiencing the issue I reported (however, I'm going to investigate this further as I might not have done it correctly).
Best, Aaron Levin On Wed, Jan 23, 2019 at 3:18 PM Aaron Levin <aaronle...@stripe.com> wrote: > Hi Ufuk, > > Two updates: > > 1. As suggested in the ticket, I naively copied the every `.so` in > `hadoop-3.0.0/lib/native/` into `/lib/` and this did not seem to help. My > knowledge of how shared libs get picked up is hazy, so I'm not sure if > blindly copying them like that should work. I did check what > `System.getProperty("java.library.path")` returns at the call-site and > it's: > java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib > 2. The exception I see comes from > `hadoop.util.NativeCodeLoader.buildSupportsSnappy` (stack-trace below). > This uses `System.loadLibrary("hadoop")`. > > [2019-01-23 19:52:33.081216] java.lang.UnsatisfiedLinkError: > org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z > [2019-01-23 19:52:33.081376] at > org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native Method) > [2019-01-23 19:52:33.081406] at > org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63) > [2019-01-23 19:52:33.081429] at > org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:195) > [2019-01-23 19:52:33.081457] at > org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:181) > [2019-01-23 19:52:33.081494] at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:2037) > [2019-01-23 19:52:33.081517] at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923) > [2019-01-23 19:52:33.081549] at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872) > ... (redacted) ... > [2019-01-23 19:52:33.081728] at > scala.collection.immutable.List.foreach(List.scala:392) > ... (redacted) ... > [2019-01-23 19:52:33.081832] at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94) > [2019-01-23 19:52:33.081854] at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:58) > [2019-01-23 19:52:33.081882] at > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99) > [2019-01-23 19:52:33.081904] at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) > [2019-01-23 19:52:33.081946] at > org.apache.flink.runtime.taskmanager.Task.run(Task.java:704) > [2019-01-23 19:52:33.081967] at java.lang.Thread.run(Thread.java:748) > > On Tue, Jan 22, 2019 at 2:31 PM Aaron Levin <aaronle...@stripe.com> wrote: > >> Hey Ufuk, >> >> So, I looked into this a little bit: >> >> 1. clarification: my issues are with the hadoop-related snappy libraries >> and not libsnappy itself (this is my bad for not being clearer, sorry!). I >> already have `libsnappy` on my classpath, but I am looking into including >> the hadoop snappy libraries. >> 2. exception: I don't see the class loading error. I'm going to try to >> put some more instrumentation and see if I can get a clearer stacktrace >> (right now I get an NPE on closing a sequence file in a finalizer - when I >> last logged the exception it was something deep in hadoop's snappy libs - >> I'll get clarification soon). >> 3. I'm looking into including hadoop's snappy libs in my jar and we'll >> see if that resolves the problem. >> >> Thanks again for your help! >> >> Best, >> >> Aaron Levin >> >> On Tue, Jan 22, 2019 at 10:47 AM Aaron Levin <aaronle...@stripe.com> >> wrote: >> >>> Hey, >>> >>> Thanks so much for the help! This is awesome. I'll start looking into >>> all of this right away and report back. >>> >>> Best, >>> >>> Aaron Levin >>> >>> On Mon, Jan 21, 2019 at 5:16 PM Ufuk Celebi <u...@apache.org> wrote: >>> >>>> Hey Aaron, >>>> >>>> sorry for the late reply. >>>> >>>> (1) I think I was able to reproduce this issue using snappy-java. I've >>>> filed a ticket here: >>>> https://issues.apache.org/jira/browse/FLINK-11402. Can you check the >>>> ticket description whether it's in line with what you are >>>> experiencing? Most importantly, do you see the same Exception being >>>> reported after cancelling and re-starting the job? >>>> >>>> (2) I don't think it's caused by the environment options not being >>>> picked up. You can check the head of the log files of the JobManager >>>> or TaskManager to verify that your provided option is picked up as >>>> expected. You should see something similar to this: >>>> >>>> 2019-01-21 22:53:49,863 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>> >>>> -------------------------------------------------------------------------------- >>>> 2019-01-21 22:53:49,864 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>> Starting StandaloneSessionClusterEntrypoint (Version: 1.7.0, >>>> Rev:49da9f9, Date:28.11.2018 @ 17:59:06 UTC) >>>> ... >>>> 2019-01-21 22:53:49,865 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM >>>> Options: >>>> 2019-01-21 22:53:49,865 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>> -Xms1024m >>>> 2019-01-21 22:53:49,865 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>> -Xmx1024m >>>> You are looking for this line ----> 2019-01-21 22:53:49,865 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>> -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64/ <---- >>>> 2019-01-21 22:53:49,865 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>> -Dlog.file=/.../flink-1.7.0/log/flink-standalonesession-0.local.log >>>> ... >>>> 2019-01-21 22:53:49,866 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>> Program Arguments: >>>> 2019-01-21 22:53:49,866 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>> --configDir >>>> 2019-01-21 22:53:49,866 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>> /.../flink-1.7.0/conf >>>> 2019-01-21 22:53:49,866 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>> --executionMode >>>> 2019-01-21 22:53:49,866 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>> cluster >>>> ... >>>> 2019-01-21 22:53:49,866 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>> >>>> -------------------------------------------------------------------------------- >>>> >>>> Can you verify that you see the log messages as expected? >>>> >>>> (3) As noted FLINK-11402, is it possible to package the snappy library >>>> as part of your user code instead of loading the library via >>>> java.library.path? In my example, that seems to work fine. >>>> >>>> – Ufuk >>>> >>>> On Thu, Jan 17, 2019 at 5:53 PM Aaron Levin <aaronle...@stripe.com> >>>> wrote: >>>> > >>>> > Hello! >>>> > >>>> > *tl;dr*: settings in `env.java.opts` seem to stop having impact when >>>> a job is canceled or fails and then is restarted (with or without >>>> savepoint/checkpoints). If I restart the task-managers, the `env.java.opts` >>>> seem to start having impact again and our job will run without failure. >>>> More below. >>>> > >>>> > We use consume Snappy-compressed sequence files in our flink job. >>>> This requires access to the hadoop native libraries. In our >>>> `flink-conf.yaml` for both the task manager and the job manager, we put: >>>> > >>>> > ``` >>>> > env.java.opts: -Djava.library.path=/usr/local/hadoop/lib/native >>>> > ``` >>>> > >>>> > If I launch our job on freshly-restarted task managers, the job >>>> operates fine. If at some point I cancel the job or if the job restarts for >>>> some other reason, the job will begin to crashloop because it tries to open >>>> a Snappy-compressed file but doesn't have access to the codec from the >>>> native hadoop libraries in `/usr/local/hadoop/lib/native`. If I then >>>> restart the task manager while the job is crashlooping, the job is start >>>> running without any codec failures. >>>> > >>>> > The only reason I can conjure that would cause the Snappy compression >>>> to fail is if the `env.java.opts` were not being passed through to the job >>>> on restart for some reason. >>>> > >>>> > Does anyone know what's going on? Am I missing some additional >>>> configuration? I really appreciate any help! >>>> > >>>> > About our setup: >>>> > >>>> > - Flink Version: 1.7.0 >>>> > - Deployment: Standalone in HA >>>> > - Hadoop/S3 setup: we do *not* set `HADOOP_CLASSPATH`. We use Flink’s >>>> shaded jars to access our files in S3. We do not use the >>>> `bundled-with-hadoop` distribution of Flink. >>>> > >>>> > Best, >>>> > >>>> > Aaron Levin >>>> >>>