p-powell opened a new issue, #5351: URL: https://github.com/apache/hudi/issues/5351
Concerned about performance. How long should the following mocked-up sample take to write to s3? There are 1,369,765 records and 308 columns. It is taking ~10.5min running in docker container on an t2.xlarge ec2 instance using the datamechanics/spark:3.2.0-hadoop-3.3.1-java-11-scala-2.12-python-3.8-latest image. Any suggestions how to increate performance. The sample file generated below is just to illustrate our issue. Steps to reproduce the behavior: 1. Start docker container docker run -it datamechanics/spark:3.2.0-hadoop-3.3.1-java-11-scala-2.12-python-3.8-latest /bin/bash 2. Download sample file cd /tmp wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv 3. Start spark shell /opt/spark/bin/spark-shell --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hadoop:hadoop-aws:2.7.3 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --driver-memory 16g 4. run the following code(replace {__bucket___} with a valid bucket): mport org.apache.spark.sql.SaveMode import org.apache.spark.sql.functions._ import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.hive.MultiPartKeysValueExtractor import org.apache.hudi.QuickstartUtils._ import scala.collection.JavaConversions._ import org.apache.spark.sql.SaveMode._ import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.row_number var df = spark.read.option("header","true").csv(file:///tmp/yellow_tripdata_2021-01.csv); var a = 0; var b = 0; // Just constructing a table for testing. var cols = df.columns; var num_cols = cols.length; // duplicating colums to make a larger dataset for( a <- 1 to 16; b <- 0 to num_cols-1){ var col_name = cols(b); var new_col_name = col_name + "_" + a; df = df.withColumn(new_col_name, col(col_name)); }; // going to written to one partition val w = Window.partitionBy(lit('A')).orderBy(lit('A')) var df_id = df.withColumn("_id", row_number().over(w)).withColumn("partpath", lit('N')) val tableName = "hudi_test" val basePath = "s3a://{__bucket___}/hudi_test_table" val starttime = System.nanoTime df_id.write.format("hudi"). option(PRECOMBINE_FIELD_OPT_KEY, "_id"). option(RECORDKEY_FIELD_OPT_KEY, "_id"). option(PARTITIONPATH_FIELD_OPT_KEY, "partpath"). option("hoodie.datasource.write.operation","upsert"). option("hoodie.datasource.write.table.type","COPY_ON_WRITE"). option(TABLE_NAME, tableName). mode(Overwrite). save(basePath) val duration = (System.nanoTime - starttime) / 1e9d print("write time:" + duration ) **Expected behavior** Not sure if this is the expected performance that we can expect with this instance size. Or if there are suggestions on how to increase the performance. **Environment Description** * Hudi version : 0.8.0 / 0.9.0 * Spark version : 3.0.1 * Hive version : * Hadoop version : 3.3.1 * Storage (HDFS/S3/GCS..) : * Running on Docker? (yes/no) : yes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org