Michal Danko created ARROW-2113: ----------------------------------- Summary: [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed" Key: ARROW-2113 URL: https://issues.apache.org/jira/browse/ARROW-2113 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH 5.13.1 Reporter: Michal Danko
Steps to replicate the issue: mkdir /tmp/test cd /tmp/test mkdir jars cd jars touch test1.jar mkdir -p ../lib/zookeeper cd ../lib/zookeeper ln -s ../../jars/test1.jar ./test1.jar ln -s test1.jar test.jar mkdir -p ../hadoop/lib cd ../hadoop/lib ln -s ../../../lib/zookeeper/test.jar ./test.jar export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar" python import pyarrow.hdfs as hdfs; fs = hdfs.connect(user="hdfs") Ends with error: ------------ loadFileSystems error: (unable to get root cause for java.lang.NoClassDefFoundError) (unable to get stack trace for java.lang.NoClassDefFoundError) hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, kerbTicketCachePath=(NULL), userName=pa) error: (unable to get root cause for java.lang.NoClassDefFoundError) (unable to get stack trace for java.lang.NoClassDefFoundError) Traceback (most recent call last): ( File "<stdin>", line 1, in <module> File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 170, in connect kerb_ticket=kerb_ticket, driver=driver) File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, in __init__ self._connect(host, port, user, kerb_ticket, driver) File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673) File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345) pyarrow.lib.ArrowIOError: HDFS connection failed ------------- export CLASSPATH="/tmp/test/lib/zookeeper/test.jar" python import pyarrow.hdfs as hdfs; fs = hdfs.connect(user="hdfs") Works properly. I can't find reason why first CLASSPATH doesn't work and second one does, because it's path to same .jar, just with extra symlink in it. To me, it looks like pyarrow.lib.check has problem with symlinks defined with many ../.../.. . I would expect that pyarrow would work with any definition of path to .jar Please notice that path are not generated at random, it is path copied from Cloudera distribution of Hadoop (original file was zookeeper.jar), Because of this issue, our customer currently can't use pyarrow lib for oozie workflows. -- This message was sent by Atlassian JIRA (v7.6.3#76005)