[ 
https://issues.apache.org/jira/browse/HIVE-23947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-23947:
----------------------------------
    Labels: pull-request-available  (was: )

> Cache affinity is unset for text files read by LLAP
> ---------------------------------------------------
>
>                 Key: HIVE-23947
>                 URL: https://issues.apache.org/jira/browse/HIVE-23947
>             Project: Hive
>          Issue Type: Bug
>          Components: llap
>            Reporter: Ádám Szita
>            Assignee: Ádám Szita
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> LLAP relies on HostAffinitySplitLocationProvider to route the same splits to 
> always the same LLAP daemons. By having such consistent split of data among 
> the nodes we can gain a good hit ratio and thus good performance.
> For text files this is almost never granted: 
> HostAffinitySplitLocationProvider is never used, because HS2 does not set the 
> cache affinity flag in the job conf for text inputformat content during 
> compile. The launched Tez AM will have to rely on HDFS location information 
> to route the splits (and therefore tasks) to the executor nodes. This 
> location information might not have a good overlap with where the actual 
> daemons are, or in S3 case, the Tez AM will mostly choose executors in a 
> random way.
> This in turn will result in the hit ratio hardly reaching 100%, each time we 
> re-run the same query, some disk/s3 read will still occur. That is until the 
> same content gets populated into all the daemons (after running the query 
> tens or hundreds of times) causing poor performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to