A problem that I have run into repeatedly when doing schema design is how
to control partition size while still allowing for efficient multi-row
queries.

We want to limit partition size to some number between 10 and 100 megabytes
to avoid operational issues. The standard way to do that is to figure out
the maximum number of rows that your "natural partition key" will ever need
to support and then add an additional artificial partition key that
segments the rows sufficiently to get keep the partition size under the
maximum. In the case of time series data, this is often done by bucketing
by time period, i.e. creating a new partition every minute, hour or day.
For non-time series data by doing something like Hash(clustering-key) mod
desired-number-of-partitions.

In my case, multi-row queries to support a REST API typically return a page
of results, where the page size might be anywhere from a few dozen up to
thousands. For query efficiency I want the average number of rows per
partition to be large enough that a query can be satisfied by reading a
small number of partitions--ideally one.

So I want to simultaneously limit the maximum number of rows per partition
and yet maintain a large enough average number of rows per partition to
make my queries efficient. But with my data the ratio between maximum and
average can be very large (up to four orders of magnitude).

Here is an example:


Rows per Partition

Partition Size

Mode

1

1 KB

Median

500

500 KB

90th percentile

5,000

5 MB

99th percentile

50,000

50 MB

Maximum

2,500,000

2.5 GB

In this case, 99% of my data could fit in a single 50 MB partition. But if
I use the standard approach, I have to split my partitions into 50 pieces
to accommodate the largest data. That means that to query the 700 rows for
my median case, I have to read 50 partitions instead of one.

If you try to deal with this by starting a new partition when an old one
fills up, you have a nasty distributed consensus problem, along with
read-before-write. Cassandra LWT wasn't available the last time I dealt
with this, but might help with the consensus part today. But there are
still some nasty corner cases.

I have some thoughts on other ways to solve this, but they all have
drawbacks. So I thought I'd ask here and hope that someone has a better
approach.

Thanks in advance,

Jim

Reply via email to