jainpriyansh786 opened a new issue #2348:
URL: https://github.com/apache/hudi/issues/2348
Hi ,
I am trying to read a hudi dataset in pyspark with Hudi 0.5.3 . In order to
avoid spark doing listing of partitions in S3 I want to specify partition paths
to load data from . The load method accepts a list of paths to load data from
in spark for other data format like csv and parquet .
However when I use the format as `hudi` , I get the exception that the
```org.apache.hudi.exception.HoodieException: 'path' must be specified```
partition_paths =
['s3a://analytics/partition1','s3a://analytics/partition2/']
df = spark.read.format('hudi').load(path=partition_paths)
I am also seeing the same exception in scala as well .
Does hudi do not support loading from multiple paths on s3 .
**Environment Description**
* Hudi version :0.5.3
* Spark version : 2.4.5
* Storage (HDFS/S3/GCS..) : AWS S3
**Stacktrace**
```Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 170, in load
return
self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/usr/local/lib/python3.7/site-packages/py4j/java_gateway.py", line
1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.7/site-packages/py4j/protocol.py", line 328,
in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o191.load.
: org.apache.hudi.exception.HoodieException: 'path' must be specified.
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:58)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:47)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]