Matthew Rocklin created ARROW-7486:
--------------------------------------

             Summary: Allow HDFS FileSystem to be created without Hadoop present
                 Key: ARROW-7486
                 URL: https://issues.apache.org/jira/browse/ARROW-7486
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Matthew Rocklin


I would like to be able to construct an HDFS FileSystem object on a machine 
without Hadoop installed.  I don't need it to be able to actually do anything.  
I just need creating it to not fail.

This would enable Dask users to run computations on an HDFS enabled cluster 
from outside of that cluster.  This almost works today.  We send a small 
computation to a worker (which has HDFS access) to generate the task graph for 
loading data, and then we bring that task graph back to the local machine, 
continue building on it, and then finally submit everything off to the workers 
for execution.

The flaw here is when we bring back the task graph from the worker back to the 
client.  It contains a reference to a PyArrow HDFSFileSystem object, which upon 
de-serialization calls _maybe_set_hadoop_classpath().  I suspect that if this 
was allowed to fail that things would work out ok for us.  

Downstream issue originally reported here: 
https://github.com/dask/dask/issues/5758



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to