Hi Mapred Learn
Please find my replies inline > What are ways to reduce stress on our cluster for running many such big > queries( include joins too) in parallel ? In some queries the generated map reduce jobs can run in parallel for that you need to set 'hive.exec.parallel' to 'true'. > How to enable compression etc for intermediate hive output ? You can enable compression in between map reduce jobs using 'hive.exec.compress.intermediate' . Compression for the map reduce jobs generated can be enabled by the following properties hive.exec.compress.output //final output compression mapred.output.compress mapred.output.compression.type mapred.output.compression.codec //map output compression mapred.compress.map.output mapred.map.output.compression.codec > How to make job cache does not go to high etc ? Hive determines the number of mappers intelligently but in some cases you need to specify the suitable number of reducers as per your data set. If you have a sufficient memory allocated for your child jvms and the slots are properly configured then there are least chances of OOMs. Also for processing large volume data you may need to increase the hive server heap size as the number of splits could be immesnsly large as well as to avoid resource crunch when we execute multiple queries in parallel. > In short , best practices for huge queries on hive ? You can go in for hive merge, if required for avoiding small files issue generated by queries. Then optimization is totally based on what you use in your queries, you can go in with join optimizations, group by optimizations etc based on your queries. Regards, Bejoy KS ----- Original Message ----- From: MiaoMiao <[email protected]> To: [email protected] Cc: Sent: Friday, September 21, 2012 8:10 AM Subject: Re: How to run big queries in optimized way ? Hive implements a format named RCFILE, which could gain better performance, but in my project, it just ties with the plain-text format. Hive also have an index feature, but not so convenient or practical. I think the best way to optimized is still reusing the same source tables, avoiding sub-queries, and merge HiveQL as many as possible. On Fri, Sep 21, 2012 at 10:30 AM, Mapred Learn <[email protected]> wrote: > Hi, > We have datasets which are about 10-15 TB in size. > > We want to run hive queries on top of this input data. > > What are ways to reduce stress on our cluster for running many such big > queries( include joins too) in parallel ? > How to enable compression etc for intermediate hive output ? > How to make job cache does not go to high etc ? > In short , best practices for huge queries on hive ? > > Any inputs are really appreciated ! > > Thanks, > JJ > > Sent from my iPhone
