[ https://issues.apache.org/jira/browse/HIVE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133720#comment-13133720 ]
Ashutosh Chauhan commented on HIVE-2088: ---------------------------------------- @Vaibhav, Looks like this could be a very useful thing if we can make it work. If you got a chance to experiment with this, it will be good to post your observations here. > Improve Hive Query split calculation performance over large partitions > (potentially 5x speedup) > ----------------------------------------------------------------------------------------------- > > Key: HIVE-2088 > URL: https://issues.apache.org/jira/browse/HIVE-2088 > Project: Hive > Issue Type: Improvement > Reporter: Vaibhav Aggarwal > > While working on improving start up time for query spanning over large number > of partitions I found a particular inefficiency in CombineHiveInputFormat. > It seems that for each input partition we create a CombineFilter to make sure > that files are combined within a particular partition. > CombineHiveInputFormat: > Path filterPath = path; > if (!poolSet.contains(filterPath)) { > LOG.info("CombineHiveInputSplit creating pool for " + path + > "; using filter path " + filterPath); > combine.createPool(job, new CombineFilter(filterPath)); > poolSet.add(filterPath); > } > These filters are passed then passed to CombineFileInputFormat along with all > the input paths. > CombineFileInputFormat computes a list of all the files in the input paths. > It them loops over each filter and then checks whether a particular file > belongs to a particular filter. > ConbineFileInputFormat: > for (MultiPathFilter onepool : pools) { > ArrayList<Path> myPaths = new ArrayList<Path>(); > > // pick one input path. If it matches all the filters in a pool, > // add it to the output set > for (int i = 0; i < paths.length; i++) { > ... > This results in computation of the order O(p*f) where p is the number of > partitions and f is the total number of files in all partitions. > For a case of 10,000 partitions with 10 files in each partition, this results > in 1,000,000,000 iterations. > We can replace this code with processing splits for one input path at a time > without passing any filters at all. > Do you happen to see a case where this approach will not work? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira