Hi Rajesh, thanks for your quick response. After quitting the job, no further containers are being launched. Unfortunately I have no execution plan (EXPLAIN output) to dive into that execution in detail.
Do you have recommendations of Tez/Hive parameters that influence the execution of TBs of data within a small amount of worker-nodes (== small no. of mappers in parallel) in general. What do be checked anyway ? Are my initial thoughts of trying to reduce the no. of files to reduce the no. of mappers going in the right direction ? thanks, G. Rajesh Balamohan <rajesh.balamo...@gmail.com> schrieb am Wed Feb 25 2015 at 11:45:07 AM: > >> > A query like "select name,count(id) from table where date='2015-01-01' or > date='2015-01-02' group by (name)" takes almost forever and needs to be > cancelled after ~30min. > >> > > It should have ideally scanned only the 2 partitions. Do you see any > container launches after which you had to kill the job? Or is the split > computation itself taking more time?. > > ~Rajesh.B > > > On Wed, Feb 25, 2015 at 1:35 PM, Gerd König < > koenig.boden...@googlemail.com> wrote: > >> Hi, >> >> I'm a bit stuck in optimizing the hive/tez config parameters to speed up >> Hive/Tez query execution. >> The cluster consists of 6 worker nodes (with rather hadoop-non-ideal >> component proportion, but that's given) including: 48Cores/384GB Ram/10HDDs. >> The Hive table is configured as: >> - partitioned by day >> - 12 buckets (bucketed on a smallint column) >> - transactional=true >> - snappy compressed ORC format >> and it contains about 200TB of data. >> Every 5 minutes new arrived data will be inserted (if any), this, of >> course, leads to a potential high number of delta-files. >> >> A query like "select name,count(id) from table where date='2015-01-01' or >> date='2015-01-02' group by (name)" takes almost forever and needs to be >> cancelled after ~30min. >> >> Of course, Hive will never be a performance beast, but by executing with >> Tez I hoped to get much better performance... >> >> Some current settings: >> yarn.nodemanager.resource.memory-mb : 304640 >> yarn.scheduler.minimum-allocation-mb : 15360 >> mapreduce.map.memory.mb : 20480 >> mapreduce.reduce.memory.mb : 25600 >> mapreduce.map.java.opts : -Xmx12288m >> mapreduce.reduce.java.opts : -Xmx15360m >> Set hive.execution.engine=tez; >> set tez.queue.name=highresourcequeue; >> set tez.am.grouping.min-size= 268435456; >> set hive.exec.reducers.max=6; >> set mapreduce.job.reduces=6; >> >> >> My thoughts are: >> - improve the data ingestion to reduce the number of delta-files and >> thereby reduce the number of mappers being required >> - improve the settings for the automatic compaction to further reduce the >> number of files, no. of mappers resp. >> - YARN config should be o.k., see properties above >> >> What are the main Tez/Hive properties to check/adjust that could improve >> the performance in the given environment ?!?! >> >> Many thanks in advance, G. >> > > > > -- > ~Rajesh.B >