Hi all, This email is an attempt to converge on which Hive/Tez/MR properties someone should use in order to schedule a compaction on specific queues. For those who are not familiar with how queues are used the YARN capacity scheduler documentation [1] gives the general idea.
Using specific queues for compaction jobs is necessary to be able to efficiently allocate resources for maintenance tasks (compaction) and production workloads. Hive provides various ways to control the queues used by the compactor and there have been various tickets with improvements and fixes in this area (see list below). The granularity we can select queues for compactions (all tables vs. per table) currently depends on which compactor is in use (MR vs Query based) and boils down to the following properties: Global configuration: * hive.compactor.job.queue * mapred.job.queue.name * tez.queue.name Per table/statement configuration (table properties): * compactor.mapred.job.queue.name (before HIVE-20723) * compactor.hive.compactor.job.queue (after HIVE-20723) Things are a bit blurred with respect to what properties someone should use to achieve the desired result. Some changes, such as HIVE-20723, raise backward compatibility concerns and other changes seem to have a larger impact than the one specifically designed for. For example, after HIVE-25595, map reduce queue properties can have an impact on the compactor queues even when Tez is in use. In order to avoid confusion and ensure long term support of these queue selection features we should clarify which of the above properties should be used. Given the current situation, I would propose to officially support only the following: * hive.compactor.job.queue * compactor.hive.compactor.job.queue and align the implementation based on these (if necessary). In other words, Hive users should not use mapred.job.queue.name and tez.queue.name explicitly at least when it comes to the compactor. Hive should set them transparently (as it happens now in various places) based on [compactor.]hive.compactor.job.queue. What do people think? Are there other ideas? Best, Stamatis [1] https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html HIVE-11997: Add ability to send Compaction Jobs to specific queue HIVE-13354: Add ability to specify Compaction options per table and per request HIVE-20723: Allow per table specification of compaction yarn queue HIVE-24781: Allow to use custom queue for query based compaction HIVE-25801: Custom queue settings is not honoured by Query based compaction StatsUpdater HIVE-25595: Custom queue settings is not honoured by compaction StatsUpdater