Hi All, I'm trying to track down the root cause of some terrible insert performance I'm seeing on Hive 0.8.1 in Amazon EMR. When I run an identical insert with a static partition I get around 15 times the the throughput (across the cluster) as I do when I use dynamic partitions. There are very few partitions generated during the insert, definitely less than 15 (the table is partitioned by a string of the form 'YYYY-MM').
During my experimentation I found that using the "hive.task.progress" configuration variable would cause the performance of the static partition version to go very slowly as well. This seems to explain the dynamic partition performance because it is forced on in SemanticAnalyzer.java: // turn on hive.task.progress to update # of partitions created to the JT HiveConf.setBoolVar(conf, HiveConf.ConfVars.HIVEJOBPROGRESS, true); Does anyone on the list know why this has to be forced on? I can't seem to find where in the code this '# of partitions' is generated, sent to the JT or even read by any other part of the Hive code. I'm also not 100% sure why this would be causing this tupe of performance problem. It might be System.currentTimeMillis() being called at least 30 times for each record because of preProcessCounter() and postProcessCounter()? 1076 /** 1077 * this is called after operator process to buffer some counters. 1078 */ 1079 private void postProcessCounter() { 1080 if (counterNameToEnum != null) { 1081 totalTime += (System.currentTimeMillis() - beginTime); 1082 } 1083 } Thanks, Shaun