Kamaraju created ARROW-5236: ------------------------------- Summary: hdfs.connect() is trying to load libjvm in windows Key: ARROW-5236 URL: https://issues.apache.org/jira/browse/ARROW-5236 Project: Apache Arrow Issue Type: Bug Environment: Windows 7 Enterprise, pyarrow 0.13.0 Reporter: Kamaraju
This issue was originally reported at [https://github.com/apache/arrow/issues/4215] . Raising a Jira as per Wes McKinney's request. Summary: The following script {code} $ cat expt2.py import pyarrow as pa fs = pa.hdfs.connect() {code} tries to load libjvm in windows 7 which is not expected. {noformat} $ python ./expt2.py Traceback (most recent call last): File "./expt2.py", line 3, in <module> fs = pa.hdfs.connect() File "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py", line 183, in connect extra_conf=extra_conf) File "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py", line 37, in __init__ self._connect(host, port, user, kerb_ticket, driver, extra_conf) File "pyarrow\io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unable to load libjvm {noformat} There is no libjvm file in Windows Java installation. {noformat} $ echo $JAVA_HOME C:\Progra~1\Java\jdk1.8.0_141 $ find $JAVA_HOME -iname '*libjvm*' <returns nothing.> {noformat} I see the libjvm error with both 0.11.1 and 0.13.0 versions of pyarrow. Steps to reproduce the issue (with more details): Create the environment {noformat} $ cat scratch_py36_pyarrow.yml name: scratch_py36_pyarrow channels: - defaults dependencies: - python=3.6.8 - pyarrow {noformat} {noformat} $ conda env create -f scratch_py36_pyarrow.yml {noformat} Apply the following patch to lib/site-packages/pyarrow/hdfs.py . I had to do this since the Hadoop installation that comes with MapR <[https://mapr.com/]> windows client only has $HADOOP_HOME/bin/hadoop.cmd . There is no file named $HADOOP_HOME/bin/hadoop and so the subsequent subprocess.check_output call fails with FileNotFoundError if this patch is not applied. {noformat} $ cat ~/x/patch.txt 131c131 < hadoop_bin = '{0}/bin/hadoop'.format(os.environ['HADOOP_HOME']) --- > hadoop_bin = '{0}/bin/hadoop.cmd'.format(os.environ['HADOOP_HOME']) $ patch /c/ProgramData/Continuum/Anaconda/envs/scratch_py36_pyarrow/lib/site-packages/pyarrow/hdfs.py ~/x/patch.txt patching file /c/ProgramData/Continuum/Anaconda/envs/scratch_py36_pyarrow/lib/site-packages/pyarrow/hdfs.py {noformat} Activate the environment {noformat} $ source activate scratch_py36_pyarrow {noformat} Sample script {noformat} $ cat expt2.py import pyarrow as pa fs = pa.hdfs.connect() {noformat} Execute the script {noformat} $ python ./expt2.py Traceback (most recent call last): File "./expt2.py", line 3, in <module> fs = pa.hdfs.connect() File "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py", line 183, in connect extra_conf=extra_conf) File "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py", line 37, in __init__ self._connect(host, port, user, kerb_ticket, driver, extra_conf) File "pyarrow\io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unable to load libjvm {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)