jtmzheng opened a new issue #2878: URL: https://github.com/apache/hudi/issues/2878
**Describe the problem you faced** This is the same issue as https://github.com/apache/hudi/issues/2566#issuecomment-821583643 in 0.8, seems like the latest version did not fix the issue (unless I'm doing something wrong here). **To Reproduce** I was able to reproduce this issue with the Dockerfile below, building with docker build -f hudi.Dockerfile -t test_hudi . and running py.test -s --verbose test_hudi.py in the container. Steps to reproduce the behavior: Dockerfile: ``` # NB: We use this base image for leveraging Docker support on EMR 6.x FROM amazoncorretto:8 RUN yum -y update RUN yum -y install yum-utils RUN yum -y groupinstall development RUN yum -y install python3 python3-dev python3-pip python3-virtualenv RUN yum -y install lzo-devel lzo ENV PYSPARK_DRIVER_PYTHON python3 ENV PYSPARK_PYTHON python3 RUN ln -sf /usr/bin/python3 /usr/bin/python & \ ln -sf /usr/bin/pip3 /usr/bin/pip RUN pip install pyspark==3.0.0 RUN pip install pytest==6.1.1 COPY ./test_hudi.py . # RUN py.test -s --verbose test_hudi.py ``` test_hudi.py ``` import pytest from pyspark import SparkConf from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql import Row def test_hudi(tmp_path): SparkContext.getOrCreate( conf=SparkConf() .setAppName("testing") .setMaster("local[6]") .set( "spark.jars.packages", "org.apache.hudi:hudi-spark-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.0,org.apache.spark:spark-sql_2.12:3.0.0", ) .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.sql.hive.convertMetastoreParquet", "false") ) spark = SparkSession.builder.getOrCreate() hudi_options = { "hoodie.table.name": "test", "hoodie.datasource.write.recordkey.field": "id", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.partitionpath.field": "year,month,day", "hoodie.datasource.write.table.name": "test", "hoodie.datasource.write.table.type": "MERGE_ON_READ", "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.precombine.field": "ts", } df = spark.createDataFrame( [ Row(id=1, year=2020, month=7, day=5, ts=1), ] ) df.write.format("hudi").options(**hudi_options).mode("append").save(str(tmp_path)) read_df = spark.read.format("parquet").load(str(tmp_path) + "/*/*/*") # This works print(read_df.collect()) read_df = spark.read.format("hudi").load(str(tmp_path) + "/*/*/*") # This does not print(read_df.collect()) ``` **Expected behavior** The test above works. See https://issues.apache.org/jira/browse/HUDI-1568 **Additional context** See https://issues.apache.org/jira/browse/HUDI-1568 **Stacktrace** Same as https://github.com/apache/hudi/issues/2566#issue-805978132 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
