[ https://issues.apache.org/jira/browse/HIVE-16999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bing Li reassigned HIVE-16999: ------------------------------ Assignee: Bing Li > Performance bottleneck in the ADD FILE/ARCHIVE commands for an HDFS resource > ---------------------------------------------------------------------------- > > Key: HIVE-16999 > URL: https://issues.apache.org/jira/browse/HIVE-16999 > Project: Hive > Issue Type: Bug > Components: Hive > Reporter: Sailee Jain > Assignee: Bing Li > Priority: Critical > > Performance bottleneck is found in adding resource[which is lying on HDFS] to > the distributed cache. > Commands used are :- > {code:java} > 1. ADD ARCHIVE "hdfs://some_dir/archive.tar" > 2. ADD FILE "hdfs://some_dir/file.txt" > {code} > Here is the log corresponding to the archive adding operation:- > {noformat} > converting to local hdfs://some_dir/archive.tar > Added resources: [hdfs://some_dir/archive.tar > {noformat} > Hive is downloading the resource to the local filesystem [shown in log by > "converting to local"]. > {color:#d04437}Ideally there is no need to bring the file to the local > filesystem when this operation is all about copying the file from one > location on HDFS to other location on HDFS[distributed cache].{color} > This adds lot of performance bottleneck when the the resource is a big file > and all commands need the same resource. > After debugging around the impacted piece of code is found to be :- > {code:java} > public List<String> add_resources(ResourceType t, Collection<String> values, > boolean convertToUnix) > throws RuntimeException { > Set<String> resourceSet = resourceMaps.getResourceSet(t); > Map<String, Set<String>> resourcePathMap = > resourceMaps.getResourcePathMap(t); > Map<String, Set<String>> reverseResourcePathMap = > resourceMaps.getReverseResourcePathMap(t); > List<String> localized = new ArrayList<String>(); > try { > for (String value : values) { > String key; > {color:#d04437}//get the local path of downloaded jars{color} > List<URI> downloadedURLs = resolveAndDownload(t, value, > convertToUnix); > ; > . > {code} > {code:java} > List<URI> resolveAndDownload(ResourceType t, String value, boolean > convertToUnix) throws URISyntaxException, > IOException { > URI uri = createURI(value); > if (getURLType(value).equals("file")) { > return Arrays.asList(uri); > } else if (getURLType(value).equals("ivy")) { > return dependencyResolver.downloadDependencies(uri); > } else { // goes here for HDFS > return Arrays.asList(createURI(downloadResource(value, > convertToUnix))); // Here when the resource is not local it will download it > to the local machine. > } > } > {code} > Here, the function resolveAndDownload() always calls the downloadResource() > api in case of external filesystem. It should take into consideration the > fact that - when the resource is on same HDFS then bringing it on local > machine is not a needed step and can be skipped for better performance. > Thanks, > Sailee -- This message was sent by Atlassian JIRA (v6.4.14#64029)