[ https://issues.apache.org/jira/browse/HIVE-28650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903922#comment-17903922 ]
Sungwoo Park edited comment on HIVE-28650 at 12/8/24 3:26 PM: -------------------------------------------------------------- I ran a simple experiment in the same small cluster, by changing the parameters fs.s3a.vectored.read.min.seek.size and fs.s3a.vectored.read.max.merged.size. The experiment used ORC 2.0.3, and ran TPC-DS query 1 to 4, or the entire 99 queries. The data size is TPC-DS 1TB, but the running time is not significant because our cluster uses just a single MinIO server. === Experiment 1 (default configuration) fs.s3a.vectored.read.min.seek.size=4K fs.s3a.vectored.read.max.merged.size=1M TPC-DS query 1 to 4: 225.15s TPC-DS 99 queries: 7454s, 7442s, 7372s number of s3.ListObjectsV2 = 10039 number of s3.HeadObject = 11275 number of s3.GetObject = 100665 Average data size in s3.GetObject: 664215.97 === Experiment 2 fs.s3a.vectored.read.min.seek.size=256K fs.s3a.vectored.read.max.merged.size=2M TPC-DS query 1 to 4: 230.172s number of s3.ListObjectsV2 = 10036 number of s3.HeadObject = 11263 number of s3.GetObject = 93936 Average data size in s3.GetObject: 711811.89 === Experiment 3 fs.s3a.vectored.read.min.seek.size=512K fs.s3a.vectored.read.max.merged.size=4M TPC-DS query 1 to 4: 222.783s TPC-DS 99 queries: 7649.588s, 7333.503s number of s3.ListObjectsV2 = 10036 number of s3.HeadObject = 11266 number of s3.GetObject = 76013 Average data size in s3.GetObject: 880055.84 As expected, increasing fs.s3a.vectored.read.min.seek.size/fs.s3a.vectored.read.max.merged.size reduces the number of s3.GetObject operations, while increasing the average data size in each s3.GetObject operation. So, what I can confirm from the experiment is that Vectored IO seems to work correctly. was (Author: glapark): I ran a simple experiment in the same small cluster, by changing the parameters fs.s3a.vectored.read.min.seek.size and fs.s3a.vectored.read.max.merged.size. The experiment used ORC 2.0.3, and ran TPC-DS query 1 to 4, or the entire 99 queries. The data size is TPC-DS 1TB, but the running time is not significant because our cluster uses just a single MinIO server. === Experiment 1 (default configuration) fs.s3a.vectored.read.min.seek.size=4K fs.s3a.vectored.read.max.merged.size=1M TPC-DS query 1 to 4: 225.15s TPC-DS 99 queries: 7454s, 7442s, 7372s # of s3.ListObjectsV2 = 10039 # of s3.HeadObject = 11275 # of s3.GetObject = 100665 Average data size in s3.GetObject: 664215.97 === Experiment 2 fs.s3a.vectored.read.min.seek.size=256K fs.s3a.vectored.read.max.merged.size=2M TPC-DS query 1 to 4: 230.172s # of s3.ListObjectsV2 = 10036 # of s3.HeadObject = 11263 # of s3.GetObject = 93936 Average data size in s3.GetObject: 711811.89 === Experiment 3 fs.s3a.vectored.read.min.seek.size=512K fs.s3a.vectored.read.max.merged.size=4M TPC-DS query 1 to 4: 222.783s TPC-DS 99 queries: 7649.588s, 7333.503s # of s3.ListObjectsV2 = 10036 # of s3.HeadObject = 11266 # of s3.GetObject = 76013 Average data size in s3.GetObject: 880055.84 As expected, increasing fs.s3a.vectored.read.min.seek.size/fs.s3a.vectored.read.max.merged.size reduces the number of s3.GetObject operations, while increasing the average data size in each s3.GetObject operation. So, what I can confirm from the experiment is that Vectored IO seems to work correctly. > Upgrade Apache ORC version to 2.0.3 > ----------------------------------- > > Key: HIVE-28650 > URL: https://issues.apache.org/jira/browse/HIVE-28650 > Project: Hive > Issue Type: Improvement > Reporter: Butao Zhang > Priority: Major > > ORC 2.0.x version added the Hadoop Vectored IO feature in ORC-1251. > We can try to upgrade ORC to latest version 2.0.x to make this feature work > in Hive. > But ORC 2.0.x is built on JDK17+, so we need to upgrade Hive jdk to 17+ > first. This depends on this ticket HIVE-26473 upgrading jdk17. -- This message was sent by Atlassian Jira (v8.20.10#820010)