[GitHub] [hudi] bvaradar commented on pull request #4910: [RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark

GitBox Sun, 27 Mar 2022 11:40:30 -0700


bvaradar commented on pull request #4910:
URL: https://github.com/apache/hudi/pull/4910#issuecomment-1079991479



   > [#4910 
(comment)](https://github.com/apache/hudi/pull/4910#discussion_r834319573) 
@bvaradar @YannByron this is test result
   > 
   > <style> </style>
   > Test case: 
   > 
   > dataSize: 100G data with 101 columns
   > 
   > testQuery: spark.time {spark.sql("select col99, col98 from hudicow_100g 
where col99 > '77'").count}
   > 
   > Test resources:
   > 
   > spark-shell --master yarn  --driver-memory 20g --executor-memory 8g 
--executor-cores 3  --num-executors 10  --conf 
spark.sql.parquet.enableVectorizedReader=true --jars 
/opt/hudi-spark3.1.2-bundle_2.12-0.11.0-SNAPSHOT.jar  --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   > 
   > Test result:
   > 
   >    1st     2st     3st     4st
   > Before_change      23051 ms        4919 ms 4777 ms 4728 ms
   > After_change       23486 ms        14705 ms        11089 ms        11564 ms
   >  
   > 
   > Let’s ignore the first time query result, since Spark is doing the 
distribution and initialization of jar packages
   > 
   > We can see that the performance degradation is very obvious
   
   @xiarixiaoyao : Not sure what before and after change means here. If I 
understand correctly, this change in schema evolution results in vectorized 
reading not working in BaseFileOnlyViewRelation. Correct? If so, can we fix it 
in this PR itself instead of next PR.  Are tables only with schema evolution 
enabled will not have vectorized working or all the tables (even when there is 
no schema evolution) affected with the change in BaseFileOnlyViewRelation?  
   
   Also, you mentioned that you have removed old hudi relation. Is this the fix 
for improving the vectorized reading. In that case, Is this change have 
vectorized reading working? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] bvaradar commented on pull request #4910: [RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark

Reply via email to