I think this problem of choosing a cluster capacity is really challenging because the desired cluster capacity depends not only on the size of the dataset but also on the complexity of queries. For example, the execution time of the TPC-DS queries on the same dataset can range from sub-10 seconds to thousands of seconds. Moreover, the desired cluster capacity may fluctuate over time. For example, one may want a large cluster during busy hours, but a small cluster at night. So, I think it is case-by-case and depends on the size of the dataset, types of queries executed, and the amount of workload in terms of the number of concurrent queries, and so on.
Because of the difficulty of choosing the right cluster capacity and unpredictability of workload, I think people are looking for solutions with autoscaling on public clouds, where the cluster capacity increases and decreases automatically. I guess most of the commercial solutions offered on public clouds support autoscaling in one way or another. --- Sungwoo On Wed, Dec 18, 2019 at 2:40 AM Sai Teja Desu < saiteja.d...@globalfoundries.com> wrote: > Hello All, > > I'm looking for a methodology on what basis we should decide the cluster > capacity for Hive. > > Can anyone recommend best practices to choose a cluster capacity for > querying data efficiently in Hive. Please note that, we have external > tables in Hive pointing to S3, so we just use Hive for querying the data. > > *Thanks,* > *Sai.* >