I think this problem of choosing a cluster capacity is really challenging
because the desired cluster capacity depends not only on the size of the
dataset but also on the complexity of queries. For example, the execution
time of the TPC-DS queries on the same dataset can range from sub-10
seconds to thousands of seconds. Moreover, the desired cluster capacity may
fluctuate over time. For example, one may want a large cluster during busy
hours, but a small cluster at night. So, I think it is case-by-case and
depends on the size of the dataset, types of queries executed, and the
amount of workload in terms of the number of concurrent queries, and so on.

Because of the difficulty of choosing the right cluster capacity and
unpredictability of workload, I think people are looking for solutions with
autoscaling on public clouds, where the cluster capacity increases and
decreases automatically. I guess most of the commercial solutions offered
on public clouds support autoscaling in one way or another.

--- Sungwoo


On Wed, Dec 18, 2019 at 2:40 AM Sai Teja Desu <
saiteja.d...@globalfoundries.com> wrote:

> Hello All,
>
> I'm looking for a methodology on what basis we should decide the cluster
> capacity for Hive.
>
> Can anyone recommend best practices to choose a cluster capacity for
> querying data efficiently in Hive. Please note that, we have external
> tables in Hive pointing to S3, so we just use Hive for querying the data.
>
> *Thanks,*
> *Sai.*
>

Reply via email to