Zoltán Borók-Nagy created IMPALA-14138:
------------------------------------------

             Summary: Detect if filesystem is not co-located in which case we 
shouldn't collect block location information
                 Key: IMPALA-14138
                 URL: https://issues.apache.org/jira/browse/IMPALA-14138
             Project: IMPALA
          Issue Type: Bug
            Reporter: Zoltán Borók-Nagy


For storage systems that support block location information (HDFS, Ozone) we 
always retrieve it with the assumption that we can use it for scheduling, to do 
local reads:
[https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java#L154]

But it's also typical that Impala is not co-located with the storage system, 
not even in on-prem deployments. E.g. PVC DS Impala runs in containers, and 
even if they are co-located, I don't think we try to figure out which container 
runs on which machine. Also short-circuit reads are off the table.In such cases 
we should not reach out to the storage system to collect file information 
because it can be very expensive for large tables and we won't benefit from it 
at all.
 
We could construct the file descriptors based on the information we have in the 
Iceberg manifests, and this is what we already do on cloud storage systems.
Is there a good indicator that Impala is not co-located with the configured 
filesystems? E.g. if data cache is enabled? But of course there could be 
multiple filesystems configured, some of them co-located, some not..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to