Improve Hive Query split calculation performance over large partitions (potentially 5x speedup) -----------------------------------------------------------------------------------------------
Key: HIVE-2088 URL: https://issues.apache.org/jira/browse/HIVE-2088 Project: Hive Issue Type: Improvement Reporter: Vaibhav Aggarwal While working on improving start up time for query spanning over large number of partitions I found a particular inefficiency in CombineHiveInputFormat. It seems that for each input partition we create a CombineFilter to make sure that files are combined within a particular partition. CombineHiveInputFormat: Path filterPath = path; if (!poolSet.contains(filterPath)) { LOG.info("CombineHiveInputSplit creating pool for " + path + "; using filter path " + filterPath); combine.createPool(job, new CombineFilter(filterPath)); poolSet.add(filterPath); } These filters are passed then passed to CombineFileInputFormat along with all the input paths. CombineFileInputFormat computes a list of all the files in the input paths. It them loops over each filter and then checks whether a particular file belongs to a particular filter. ConbineFileInputFormat: for (MultiPathFilter onepool : pools) { ArrayList<Path> myPaths = new ArrayList<Path>(); // pick one input path. If it matches all the filters in a pool, // add it to the output set for (int i = 0; i < paths.length; i++) { ... This results in computation of the order O(p*f) where p is the number of partitions and f is the total number of files in all partitions. For a case of 10,000 partitions with 10 files in each partition, this results in 1,000,000,000 iterations. We can replace this code with processing splits for one input path at a time without passing any filters at all. Do you happen to see a case where this approach will not work? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira