[DISCUSS] Properties for scheduling compactions on specific queues

Stamatis Zampetakis Mon, 31 Jan 2022 01:51:44 -0800

Hi all,

This email is an attempt to converge on which Hive/Tez/MR properties
someone should use in order to schedule a compaction on specific queues.
For those who are not familiar with how queues are used the YARN capacity
scheduler documentation [1] gives the general idea.


Using specific queues for compaction jobs is necessary to be able to
efficiently allocate resources for maintenance tasks (compaction) and
production workloads. Hive provides various ways to control the queues used
by the compactor and there have been various tickets with improvements and
fixes in this area (see list below).

The granularity we can select queues for compactions (all tables vs. per
table) currently depends on which compactor is in use (MR vs Query based)
and boils down to the following properties:

Global configuration:
* hive.compactor.job.queue
* mapred.job.queue.name
* tez.queue.name

Per table/statement configuration (table properties):
* compactor.mapred.job.queue.name (before HIVE-20723)
* compactor.hive.compactor.job.queue (after HIVE-20723)

Things are a bit blurred with respect to what properties someone should use
to achieve the desired result. Some changes, such as HIVE-20723, raise
backward compatibility concerns and other changes seem to have a larger
impact than the one specifically designed for. For example, after
HIVE-25595, map reduce queue properties can have an impact on the compactor
queues even when Tez is in use.

In order to avoid confusion and ensure long term support of these queue
selection features we should clarify which of the above properties should
be used.

Given the current situation, I would propose to officially support only the
following:
* hive.compactor.job.queue
* compactor.hive.compactor.job.queue
and align the implementation based on these (if necessary). In other words,
Hive users should not use mapred.job.queue.name and tez.queue.name
explicitly at least when it comes to the compactor. Hive should set them
transparently (as it happens now in various places) based on
[compactor.]hive.compactor.job.queue.

What do people think? Are there other ideas?

Best,
Stamatis

[1]
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

HIVE-11997: Add ability to send Compaction Jobs to specific queue
HIVE-13354: Add ability to specify Compaction options per table and per
request
HIVE-20723: Allow per table specification of compaction yarn queue
HIVE-24781: Allow to use custom queue for query based compaction
HIVE-25801: Custom queue settings is not honoured by Query based compaction
StatsUpdater
HIVE-25595: Custom queue settings is not honoured by compaction StatsUpdater

[DISCUSS] Properties for scheduling compactions on specific queues

Reply via email to