[ https://issues.apache.org/jira/browse/HIVE-28650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17902254#comment-17902254 ]
Steve Loughran commented on HIVE-28650: --------------------------------------- Those slides were done by [~mthakur], I just reused them. We would love to know what you're seeing here. I suspect that while you are seeing more GET requests, each one is for a small amount of data. And we are doing them in parallel. If minio can collect access logs, please look at how s3a adds call info into the S3 server logs: https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/auditing We put call info into the http referrer header, including something the normal S3 logs do not collect: those ranges. There are a couple of options you can also use to tune this, and again, we would love to know what you get from tuning it. * fs.s3a.vectored.read.min.seek.size : the size of a gap between two requests for us to merge (default: 4k) * fs.s3a.vectored.read.max.merged.size : max size of a merged GET request before we stop merging (default 1M) I would recommend you make both of them bigger, trying something like 256K as that seek size, 2M for the max merged size. Anything you can do in benchmarking to give us better numbers through ORC on Minio will be really useful. That facebook Velox paper discusses how they used 500K as their merge threshold. This paper came out after our work, otherwise I think we'd have lifted them: https://research.facebook.com/publications/velox-metas-unified-execution-engine/ bq. IO reads for nearby columns are typically coalesced (merged) if the gap between them is small enough (currently about 20K for SSD and 500K for disaggregated storage), aiming to serve neighboring reads in as few IO reads as possible. The more information we get the better, and something we could document as well as tuning those defaults. I do think that min seek size in something we could really increase, with a default value everywhere of something like 16k, except on azure and s3 where we go up to 128k. I'd welcome any update to our performance doc performance.md on this too, as well as a minio-specific section https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/performance.html Final points * for any report on performance, can you give us the storediag output from https://github.com/steveloughran/cloudstore ; a "before" one is something you could do today. * if you can do this ASAP we could change the values for the 3.4.2 release which will be kicked off in january Thanks > Upgrade Apache ORC version to 2.0.3 > ----------------------------------- > > Key: HIVE-28650 > URL: https://issues.apache.org/jira/browse/HIVE-28650 > Project: Hive > Issue Type: Improvement > Reporter: Butao Zhang > Priority: Major > > ORC 2.0.x version added the Hadoop Vectored IO feature in ORC-1251. > We can try to upgrade ORC to latest version 2.0.x to make this feature work > in Hive. > But ORC 2.0.x is built on JDK17+, so we need to upgrade Hive jdk to 17+ > first. This depends on this ticket HIVE-26473 upgrading jdk17. -- This message was sent by Atlassian Jira (v8.20.10#820010)