bvaradar commented on pull request #4910: URL: https://github.com/apache/hudi/pull/4910#issuecomment-1079991479
> [#4910 (comment)](https://github.com/apache/hudi/pull/4910#discussion_r834319573) @bvaradar @YannByron this is test result > > <style> </style> > Test case: > > dataSize: 100G data with 101 columns > > testQuery: spark.time {spark.sql("select col99, col98 from hudicow_100g where col99 > '77'").count} > > Test resources: > > spark-shell --master yarn --driver-memory 20g --executor-memory 8g --executor-cores 3 --num-executors 10 --conf spark.sql.parquet.enableVectorizedReader=true --jars /opt/hudi-spark3.1.2-bundle_2.12-0.11.0-SNAPSHOT.jar --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' > > Test result: > > 1st 2st 3st 4st > Before_change 23051 ms 4919 ms 4777 ms 4728 ms > After_change 23486 ms 14705 ms 11089 ms 11564 ms > > > Let’s ignore the first time query result, since Spark is doing the distribution and initialization of jar packages > > We can see that the performance degradation is very obvious @xiarixiaoyao : Not sure what before and after change means here. If I understand correctly, this change in schema evolution results in vectorized reading not working in BaseFileOnlyViewRelation. Correct? If so, can we fix it in this PR itself instead of next PR. Are tables only with schema evolution enabled will not have vectorized working or all the tables (even when there is no schema evolution) affected with the change in BaseFileOnlyViewRelation? Also, you mentioned that you have removed old hudi relation. Is this the fix for improving the vectorized reading. In that case, Is this change have vectorized reading working? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
