Each AsterixDB cluster today consists of one or more Node Controllers (NC)
where the data is stored and processed. Each NC has a predefined set of
storage partitions (iodevices). When data is ingested into the system, the
data is hash-partitioned across the total number of storage partitions in
the cluster. Similarly, when the data is queried, each NC will start as
many threads as the number storage partitions it has to read and process
the data in parallel. While this shared-nothing architecture has its
advantages, it has its drawbacks too. One major drawback is the time needed
to scale the cluster. Adding a new NC to an existing cluster of (n) nodes
means writing a completely new copy of the data which will now be
hash-partitioned to the new total number of storage partitions of (n + 1)
nodes. This operation could potentially take several hours or even days
which is unacceptable in the cloud age.

This APE is about adding a new deployment (cloud) mode to AsterixDB by
implementing compute-storage separation to take advantage of the elasticity
of the cloud. This will require the following:

1. Moving from the dynamic data partitioning described earlier to a static
data partitioning based on a configurable, but fixed during a cluster's
life, number of storage partitions.
2. Introducing the concept of a "compute partition" where each NC will have
a fixed number of compute partitions. This number could potentially be
based on the number of CPU cores it has.

This will decouple the number of storage partitions being processed on an
NC from the number of its compute partitions.

When an AsterixDB cluster is deployed using the cloud mode, we will do the
following:

- The Cluster Controller will maintain a map containing the assignment of
storage partitions to compute partitions.
- New writes will be written to the NC's local storage and uploaded to an
object store (e.g. AWS S3) which will be used as a highly available shared
filesystem between NCs.
- On queries, each NC will start as many threads as its compute partitions
to process its currently assigned storage partitions.
- On scaling operations, we will simply update the assignment map and NCs
will lazily cache any data of newly assigned storage partitions from the
object store.

Improvement tickets:
Static data partitioning:
https://issues.apache.org/jira/browse/ASTERIXDB-3144
Compute-Storage Separation
https://issues.apache.org/jira/browse/ASTERIXDB-3196

Please vote on this APE. We'll keep this open for 72 hours and pass with
either 3 votes or a majority of positive votes.

Reply via email to