Hi Spark community,

Any resolution would be highly appreciated.

Few additional analysis from my side:

The lag in writing parquet exists in spark 3.5.0, but no lag in spark 3.1.2
or 2.4.5.

Also, I found out that the task WholeStageCodeGen(1) --> ColumnarToRow is
the one which is taking the most time (almost 3 mins for a simple 3 mb
file) in spark 3.5.0. Input batch size of this stage is 10,and output
record count is 30,000.     The same CoumnarToRow task in spark 3.1.2
finishes in 10 secs.
Further, with spark 3.5.0 if I cache the dataframe and materialise it using
df.count() and then write the df into parquet file, then the ColumnarToRow
gets called twice, first takes 10 secs and second one 3 mins.

On Wed, 31 Jul, 2024, 10:14 PM Bijoy Deb, <bijoy.comput...@gmail.com> wrote:

> Hi,
>
> We are using Spark on-premise to simply read a parquet file from
> GCS(Google Cloud storage) into the DataFrame and write the DataFrame into
> another folder in parquet format in GCS, using below code:
>
> ____________________________________________
>
> DFS_BLOCKSIZE = 512 * 1024 * 1024
>
>
> spark = SparkSession.builder \
>         .appName("test_app_parquet_load") \
>  .config("spark.master", "spark://spark-master-svc:7077") \
>         .config("spark.driver.maxResultSize", '1g') \
>         .config("spark.driver.memory", '1g') \
>      .config("spark.executor.cores",4) \
>         .config("spark.sql.shuffle.partitions", 16) \
>        .config("spark.sql.files.maxPartitionBytes", DFS_BLOCKSIZE) \
>
>  .getOrCreate()
>
>
> folder="gs://input_folder/input1/key=20240610"
> print(f"reading parquet from {folder}")
>
> start_time1 = time.time()
>
> data_df = spark.read.parquet(folder)
>
> end_time1 = time.time()
> print(f"Time duration for reading parquet t1: {end_time1 - start_time1}")
>
>
> start_time2 = time.time()
>
> data_df.write.mode("overwrite").parquet("gs://output_folder/output/key=20240610")
>
> end_time2 = time.time()
> print(f"Time duration for writing parquet t3: {end_time2 - start_time2}")
>
> spark.stop()
>
>
>
>
>
>
>
>
> ______________________________
>
>
> However, we observed a drastic time difference between Spark 2.4.5 and
> 3.5.0 in the writing process.Even in case of local filesystem instead of
> gcs, spark 3.5.0 is taking long time.
>
> In Spark 2.4.5, the above code takes about 10 seconds for Parquet to read
> and 20 seconds for write, while in Spark 3.5.0 read takes almost similar
> time but write takes nearly 3 minutes. The size of the file is just 3 MB.
> Further, we have noticed that if we read a CSV file instead of parquet into
> DataFrame and write to another folder in parquet format, Spark 3.5.0 takes
> relatively less time to write, about 30-40 seconds.
>
> So, it looks like only reading a parquet file to a dataframe and writing
> that dataframe to another parquet file is taking too long in the case of
> Spark 3.5.0.
>
> We are seeing that there is no slowness even with Spark 3.1.2. So, it
> seems that the issue with spark job taking too long to write a parquet
> based dataframe into another parquet file (in gcs or local filesystem both)
> is specific to spark 3.5.0. Looks to be either a potential bug in Spark
> 3.5.0 or some parquet related configuration that is not clearly documented.
> Any help in this regard would be highly appreciated.
>
>
> Thanks,
>
> Bijoy
>

Reply via email to