>> A query like "select name,count(id) from table where date='2015-01-01' or date='2015-01-02' group by (name)" takes almost forever and needs to be cancelled after ~30min. >>
It should have ideally scanned only the 2 partitions. Do you see any container launches after which you had to kill the job? Or is the split computation itself taking more time?. ~Rajesh.B On Wed, Feb 25, 2015 at 1:35 PM, Gerd König <koenig.boden...@googlemail.com> wrote: > Hi, > > I'm a bit stuck in optimizing the hive/tez config parameters to speed up > Hive/Tez query execution. > The cluster consists of 6 worker nodes (with rather hadoop-non-ideal > component proportion, but that's given) including: 48Cores/384GB Ram/10HDDs. > The Hive table is configured as: > - partitioned by day > - 12 buckets (bucketed on a smallint column) > - transactional=true > - snappy compressed ORC format > and it contains about 200TB of data. > Every 5 minutes new arrived data will be inserted (if any), this, of > course, leads to a potential high number of delta-files. > > A query like "select name,count(id) from table where date='2015-01-01' or > date='2015-01-02' group by (name)" takes almost forever and needs to be > cancelled after ~30min. > > Of course, Hive will never be a performance beast, but by executing with > Tez I hoped to get much better performance... > > Some current settings: > yarn.nodemanager.resource.memory-mb : 304640 > yarn.scheduler.minimum-allocation-mb : 15360 > mapreduce.map.memory.mb : 20480 > mapreduce.reduce.memory.mb : 25600 > mapreduce.map.java.opts : -Xmx12288m > mapreduce.reduce.java.opts : -Xmx15360m > Set hive.execution.engine=tez; > set tez.queue.name=highresourcequeue; > set tez.am.grouping.min-size= 268435456; > set hive.exec.reducers.max=6; > set mapreduce.job.reduces=6; > > > My thoughts are: > - improve the data ingestion to reduce the number of delta-files and > thereby reduce the number of mappers being required > - improve the settings for the automatic compaction to further reduce the > number of files, no. of mappers resp. > - YARN config should be o.k., see properties above > > What are the main Tez/Hive properties to check/adjust that could improve > the performance in the given environment ?!?! > > Many thanks in advance, G. > -- ~Rajesh.B