Hi, I'm a bit stuck in optimizing the hive/tez config parameters to speed up Hive/Tez query execution. The cluster consists of 6 worker nodes (with rather hadoop-non-ideal component proportion, but that's given) including: 48Cores/384GB Ram/10HDDs. The Hive table is configured as: - partitioned by day - 12 buckets (bucketed on a smallint column) - transactional=true - snappy compressed ORC format and it contains about 200TB of data. Every 5 minutes new arrived data will be inserted (if any), this, of course, leads to a potential high number of delta-files.
A query like "select name,count(id) from table where date='2015-01-01' or date='2015-01-02' group by (name)" takes almost forever and needs to be cancelled after ~30min. Of course, Hive will never be a performance beast, but by executing with Tez I hoped to get much better performance... Some current settings: yarn.nodemanager.resource.memory-mb : 304640 yarn.scheduler.minimum-allocation-mb : 15360 mapreduce.map.memory.mb : 20480 mapreduce.reduce.memory.mb : 25600 mapreduce.map.java.opts : -Xmx12288m mapreduce.reduce.java.opts : -Xmx15360m Set hive.execution.engine=tez; set tez.queue.name=highresourcequeue; set tez.am.grouping.min-size= 268435456; set hive.exec.reducers.max=6; set mapreduce.job.reduces=6; My thoughts are: - improve the data ingestion to reduce the number of delta-files and thereby reduce the number of mappers being required - improve the settings for the automatic compaction to further reduce the number of files, no. of mappers resp. - YARN config should be o.k., see properties above What are the main Tez/Hive properties to check/adjust that could improve the performance in the given environment ?!?! Many thanks in advance, G.