[ 
https://issues.apache.org/jira/browse/HIVE-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882294#comment-15882294
 ] 

Vihang Karajgaonkar commented on HIVE-15879:
--------------------------------------------

Thanks [~rajesh.balamohan] for taking time to look into this. 

bq. I agree that ThreadPoolExecutor.getActiveCount() is approximate. It is 
approximate because, by the time getActiveCount() iterates over the running 
threads in the worker list, it is possible that some of the threads which were 
executing are complete. 
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/concurrent/ThreadPoolExecutor.java#l1818.
 So the reported numbers could be slightly higher than the actually running 
threads. But it would never be less, as new Worker in ThreadPoolExecutor is 
added with mainLock.

Isn't this a implementation detail of ThreadPoolExecutor? Should we rely on 
this implementation which can change in future versions of java? 

Even if we agree that this still okay to rely on, iterative parallel BFS is 
still much faster than the current implementation. In order to prove this I 
created a benchmark to investigate the performance of msck before and after the 
patch. One of the issues in benchmarking this performance is that the min pool 
size is determined by {{Runtime.getRuntime().availableProcessors() * 2}} which 
on my machine is 20. I wanted to test the performance of existing 
implementation when the pool size is <= nested level of directories. In order 
to make it easier to test I commented out {{poolSize = poolSize == 0 ? poolSize 
: Math.max(poolSize, Runtime.getRuntime().availableProcessors() * 2);}} and 
relied only on the config value {{hive.mv.files.thread}} for the poolsize. This 
is just to make it easy to test and I am sure that problem will be reproducible 
when nested levels is greater than pool size.

Here are the results:
Table specifications : 3 level nested directory structure, each directory 
contains 5 sub-directories. So 5*5*5 = 125 partitions on S3
Table schema : msck_test(col1 string) partitioned by (part1 string, part2 
string, part3 string)
Ran {{msck repair table msck_test}} and noted the execution times.

Current implementation:
||Pool size || 0 || 2 || 3  || 4 || 8 || 15 (default)||
|| Time taken (sec) |446|1161|855|636|375|217|

With the patch:
||Pool size || 0 || 2 || 4 || 8 || 15 (default)||
|| Time taken (sec) |470|16|11|9|9|

So just a pool size of 2 is better than default pool size of 15 with the patch. 
WIth default configs this patch improves performance by  ~22x. The reason this 
happens is because iterative implementation uses the threadpool optimally since 
no threads are blocked on other threads. In recursive implementation one thread 
blocks on other thread while working on sub-directories. With the additional 
overhead of locking you can see that there is actually a performance 
degradation when nested levels is more than pool size compared to even 
single-threaded implementation.

I am open to suggestions but in my opinion this is still a good approach unless 
there is something which I am missing completely. Please let me know what you 
think? Thanks!

> Fix HiveMetaStoreChecker.checkPartitionDirs method
> --------------------------------------------------
>
>                 Key: HIVE-15879
>                 URL: https://issues.apache.org/jira/browse/HIVE-15879
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Vihang Karajgaonkar
>            Assignee: Vihang Karajgaonkar
>         Attachments: HIVE-15879.01.patch
>
>
> HIVE-15803 fixes the msck hang issue in 
> HiveMetaStoreChecker.checkPartitionDirs method by adding a check to see if 
> the Threadpool has any spare threads. If not it uses single threaded listing 
> of the files.
> {noformat}
>     if (pool != null) {
>       synchronized (pool) {
>         // In case of recursive calls, it is possible to deadlock with TP. 
> Check TP usage here.
>         if (pool.getActiveCount() < pool.getMaximumPoolSize()) {
>           useThreadPool = true;
>         }
>         if (!useThreadPool) {
>           if (LOG.isDebugEnabled()) {
>             LOG.debug("Not using threadPool as active count:" + 
> pool.getActiveCount()
>                 + ", max:" + pool.getMaximumPoolSize());
>           }
>         }
>       }
>     }
> {noformat}
> Based on the java doc of getActiveCount() below 
> bq. Returns the approximate number of threads that are actively executing 
> tasks.
> it returns only approximate number of threads and it cannot be guaranteed 
> that it always returns the exact number of active threads. This still exposes 
> the method implementation to the msck hang bug in rare corner cases.
> We could either:
> 1. Use a atomic counter to track exactly how many threads are actively running
> 2. Relook at the method itself to make it much simpler. Like eg, look into 
> the possibility of changing the recursive implementation to an iterative 
> implementation where worker threads pick tasks from a queue until the queue 
> is empty.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to