[GitHub] [hudi] xiarixiaoyao edited a comment on pull request #4910: [RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark

GitBox Sun, 27 Mar 2022 18:56:10 -0700


xiarixiaoyao edited a comment on pull request #4910:
URL: https://github.com/apache/hudi/pull/4910#issuecomment-1080095336



   > > [#4910 
(comment)](https://github.com/apache/hudi/pull/4910#discussion_r834319573) 
@bvaradar @YannByron this is test result
   > > <style> </style>
   > > Test case: 
   > > dataSize: 100G data with 101 columns
   > > testQuery: spark.time {spark.sql("select col99, col98 from hudicow_100g 
where col99 > '77'").count}
   > > Test resources:
   > > spark-shell --master yarn  --driver-memory 20g --executor-memory 8g 
--executor-cores 3  --num-executors 10  --conf 
spark.sql.parquet.enableVectorizedReader=true --jars 
/opt/hudi-spark3.1.2-bundle_2.12-0.11.0-SNAPSHOT.jar  --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   > > Test result:
   > >          1st     2st     3st     4st
   > > Before_change    23051 ms        4919 ms 4777 ms 4728 ms
   > > After_change     23486 ms        14705 ms        11089 ms        11564 ms
   > >  
   > > Let’s ignore the first time query result, since Spark is doing the 
distribution and initialization of jar packages
   > > We can see that the performance degradation is very obvious
   > 
   > @xiarixiaoyao : Not sure what before and after change means here. If I 
understand correctly, this change in schema evolution results in vectorized 
reading not working in BaseFileOnlyViewRelation. Correct? If so, can we fix it 
in this PR itself instead of next PR. Are tables only with schema evolution 
enabled will not have vectorized working or all the tables (even when there is 
no schema evolution) affected with the change in BaseFileOnlyViewRelation?
   > 
   > Also, you mentioned that you have removed old hudi relation. Is this the 
fix for improving the vectorized reading. In that case, Is this change have 
vectorized reading working?
   
   1）This problem has nothing to do with schema evolution. It is a problem of 
Hudi itself. 
   2）we have already fixed this problem in this pr.
   3）This problem has been discussed with xushiyan, alexeykudinkin, YannByron . 
    This problem can be repaired in this PR or another pr.  I prefer the latter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xiarixiaoyao edited a comment on pull request #4910: [RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark

Reply via email to