[ https://issues.apache.org/jira/browse/HIVE-23947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HIVE-23947: ---------------------------------- Labels: pull-request-available (was: ) > Cache affinity is unset for text files read by LLAP > --------------------------------------------------- > > Key: HIVE-23947 > URL: https://issues.apache.org/jira/browse/HIVE-23947 > Project: Hive > Issue Type: Bug > Components: llap > Reporter: Ádám Szita > Assignee: Ádám Szita > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > LLAP relies on HostAffinitySplitLocationProvider to route the same splits to > always the same LLAP daemons. By having such consistent split of data among > the nodes we can gain a good hit ratio and thus good performance. > For text files this is almost never granted: > HostAffinitySplitLocationProvider is never used, because HS2 does not set the > cache affinity flag in the job conf for text inputformat content during > compile. The launched Tez AM will have to rely on HDFS location information > to route the splits (and therefore tasks) to the executor nodes. This > location information might not have a good overlap with where the actual > daemons are, or in S3 case, the Tez AM will mostly choose executors in a > random way. > This in turn will result in the hit ratio hardly reaching 100%, each time we > re-run the same query, some disk/s3 read will still occur. That is until the > same content gets populated into all the daemons (after running the query > tens or hundreds of times) causing poor performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)