[ https://issues.apache.org/jira/browse/HIVE-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882294#comment-15882294 ]
Vihang Karajgaonkar commented on HIVE-15879: -------------------------------------------- Thanks [~rajesh.balamohan] for taking time to look into this. bq. I agree that ThreadPoolExecutor.getActiveCount() is approximate. It is approximate because, by the time getActiveCount() iterates over the running threads in the worker list, it is possible that some of the threads which were executing are complete. http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/concurrent/ThreadPoolExecutor.java#l1818. So the reported numbers could be slightly higher than the actually running threads. But it would never be less, as new Worker in ThreadPoolExecutor is added with mainLock. Isn't this a implementation detail of ThreadPoolExecutor? Should we rely on this implementation which can change in future versions of java? Even if we agree that this still okay to rely on, iterative parallel BFS is still much faster than the current implementation. In order to prove this I created a benchmark to investigate the performance of msck before and after the patch. One of the issues in benchmarking this performance is that the min pool size is determined by {{Runtime.getRuntime().availableProcessors() * 2}} which on my machine is 20. I wanted to test the performance of existing implementation when the pool size is <= nested level of directories. In order to make it easier to test I commented out {{poolSize = poolSize == 0 ? poolSize : Math.max(poolSize, Runtime.getRuntime().availableProcessors() * 2);}} and relied only on the config value {{hive.mv.files.thread}} for the poolsize. This is just to make it easy to test and I am sure that problem will be reproducible when nested levels is greater than pool size. Here are the results: Table specifications : 3 level nested directory structure, each directory contains 5 sub-directories. So 5*5*5 = 125 partitions on S3 Table schema : msck_test(col1 string) partitioned by (part1 string, part2 string, part3 string) Ran {{msck repair table msck_test}} and noted the execution times. Current implementation: ||Pool size || 0 || 2 || 3 || 4 || 8 || 15 (default)|| || Time taken (sec) |446|1161|855|636|375|217| With the patch: ||Pool size || 0 || 2 || 4 || 8 || 15 (default)|| || Time taken (sec) |470|16|11|9|9| So just a pool size of 2 is better than default pool size of 15 with the patch. WIth default configs this patch improves performance by ~22x. The reason this happens is because iterative implementation uses the threadpool optimally since no threads are blocked on other threads. In recursive implementation one thread blocks on other thread while working on sub-directories. With the additional overhead of locking you can see that there is actually a performance degradation when nested levels is more than pool size compared to even single-threaded implementation. I am open to suggestions but in my opinion this is still a good approach unless there is something which I am missing completely. Please let me know what you think? Thanks! > Fix HiveMetaStoreChecker.checkPartitionDirs method > -------------------------------------------------- > > Key: HIVE-15879 > URL: https://issues.apache.org/jira/browse/HIVE-15879 > Project: Hive > Issue Type: Bug > Reporter: Vihang Karajgaonkar > Assignee: Vihang Karajgaonkar > Attachments: HIVE-15879.01.patch > > > HIVE-15803 fixes the msck hang issue in > HiveMetaStoreChecker.checkPartitionDirs method by adding a check to see if > the Threadpool has any spare threads. If not it uses single threaded listing > of the files. > {noformat} > if (pool != null) { > synchronized (pool) { > // In case of recursive calls, it is possible to deadlock with TP. > Check TP usage here. > if (pool.getActiveCount() < pool.getMaximumPoolSize()) { > useThreadPool = true; > } > if (!useThreadPool) { > if (LOG.isDebugEnabled()) { > LOG.debug("Not using threadPool as active count:" + > pool.getActiveCount() > + ", max:" + pool.getMaximumPoolSize()); > } > } > } > } > {noformat} > Based on the java doc of getActiveCount() below > bq. Returns the approximate number of threads that are actively executing > tasks. > it returns only approximate number of threads and it cannot be guaranteed > that it always returns the exact number of active threads. This still exposes > the method implementation to the msck hang bug in rare corner cases. > We could either: > 1. Use a atomic counter to track exactly how many threads are actively running > 2. Relook at the method itself to make it much simpler. Like eg, look into > the possibility of changing the recursive implementation to an iterative > implementation where worker threads pick tasks from a queue until the queue > is empty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)