Hi Mapred Learn

Please find my replies inline


> What are ways to reduce stress on our cluster for running many such big 
> queries( include joins too) in parallel ?
In some queries the generated map reduce jobs can run in parallel for that you 
need to set 'hive.exec.parallel' to 'true'.


> How to enable compression etc for intermediate hive output ?

You can enable compression in between map reduce jobs using 
'hive.exec.compress.intermediate' .

Compression for the map reduce jobs generated can be enabled by the following 
properties

hive.exec.compress.output
//final output compression

mapred.output.compress
mapred.output.compression.type
mapred.output.compression.codec
//map output compression

mapred.compress.map.output
mapred.map.output.compression.codec


> How to make job cache does not go to high etc ?
Hive determines the number of mappers intelligently but in some cases you need 
to specify the suitable number of reducers as per your data set. If you have a 
sufficient memory allocated for your child jvms and the slots are properly 
configured then there are least chances of OOMs.  Also for processing large 
volume data you may need to increase the hive server heap size as the number of 
splits could be immesnsly large as well as to avoid resource crunch when we 
execute multiple queries in parallel.


> In short , best practices for huge queries on hive ?
You can go in for hive merge, if required for avoiding small files issue 
generated by queries. Then optimization is totally based on what you use in 
your queries, you can go in with join optimizations, group by optimizations etc 
based on your queries.


 
Regards,
Bejoy KS


----- Original Message -----
From: MiaoMiao <[email protected]>
To: [email protected]
Cc: 
Sent: Friday, September 21, 2012 8:10 AM
Subject: Re: How to run big queries in optimized way ?

Hive implements a format named RCFILE, which could gain better
performance, but in my project, it just ties with the plain-text
format.

Hive also have an index feature, but not so convenient or practical.

I think the best way to optimized is still reusing the same source
tables, avoiding sub-queries, and merge HiveQL as many as possible.
On Fri, Sep 21, 2012 at 10:30 AM, Mapred Learn <[email protected]> wrote:
> Hi,
> We have datasets which are about 10-15 TB in size.
>
> We want to run hive queries on top of this input data.
>
> What are ways to reduce stress on our cluster for running many such big 
> queries( include joins too) in parallel ?
> How to enable compression etc for intermediate hive output ?
> How to make job cache does not go to high etc ?
> In short , best practices for huge queries on hive ?
>
> Any inputs are really appreciated !
>
> Thanks,
> JJ
>
> Sent from my iPhone

Reply via email to