Hi Dan,

On Friday, October 16, 2020 4:12 PM, Dan Williams <dan.j.willi...@intel.com> 
wrote:
> 
> On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
> (nmeeramohide) <nmeeramoh...@micron.com> wrote:
> >
> > On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig
> <h...@infradead.org> wrote:
> > > I don't think this belongs into the kernel.  It is a classic case for
> > > infrastructure that should be built in userspace.  If anything is
> > > missing to implement it in userspace with equivalent performance we
> > > need to improve out interfaces, although io_uring should cover pretty
> > > much everything you need.
> >
> > Hi Christoph,
> >
> > We previously considered moving the mpool object store code to user-space.
> > However, by implementing mpool as a device driver, we get several benefits
> > in terms of scalability, performance, and functionality. In doing so, we 
> > relied
> > only on standard interfaces and did not make any changes to the kernel.
> >
> > (1)  mpool's "mcache map" facility allows us to memory-map (and later unmap)
> > a collection of logically related objects with a single system call. The 
> > objects in
> > such a collection are created at different times, physically disparate, and 
> > may
> > even reside on different media class volumes.
> >
> > For our HSE storage engine application, there are commonly 10's to 100's of
> > objects in a given mcache map, and 75,000 total objects mapped at a given
> time.
> >
> > Compared to memory-mapping objects individually, the mcache map facility
> > scales well because it requires only a single system call and single
> vm_area_struct
> > to memory-map a complete collection of objects.

> Why can't that be a batch of mmap calls on io_uring?

Agreed, we could add the capability to invoke mmap via io_uring to help 
mitigate the
system call overhead of memory-mapping individual objects, versus our mache map
mechanism. However, there is still the scalability issue of having a 
vm_area_struct
for each object (versus one for each mache map).

We ran YCSB workload C in two different configurations -
Config 1: memory-mapping each individual object
Config 2: memory-mapping a collection of related objects using mcache map

- Config 1 incurred ~3.3x additional kernel memory for the vm_area_struct slab -
24.8 MB (127188 objects) for config 1, versus 7.3 MB (37482 objects) for config 
2.

- Workload C exhibited around 10-25% better tail latencies (4-nines) for config 
2,
not sure if it's due the reduced complexity of searching VMAs during page 
faults.

> > (2) The mcache map reaper mechanism proactively evicts object data from the
> page
> > cache based on object-level metrics. This provides significant performance
> benefit
> > for many workloads.
> >
> > For example, we ran YCSB workloads B (95/5 read/write mix)  and C (100% 
> > read)
> > against our HSE storage engine using the mpool driver in a 5.9 kernel.
> > For each workload, we ran with the reaper turned-on and turned-off.
> >
> > For workload B, the reaper increased throughput 1.77x, while reducing 99.99%
> tail
> > latency for reads by 39% and updates by 99%. For workload C, the reaper
> increased
> > throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. 
> > These
> > improvements are even more dramatic with earlier kernels.

> What metrics proved useful and can the vanilla page cache / page
> reclaim mechanism be augmented with those metrics?

The mcache map facility is designed to cache a collection of related immutable 
objects
with similar lifetimes. It is best suited for storage applications that run 
queries against
organized collections of immutable objects, such as storage engines and DBs 
based on
SSTables.

Each mcache map is associated with a temperature (pinned, hot, warm, cold) and 
is left
to the application to tag it appropriately. For our HSE storage engine 
application,
the SSTables in the root/intermediate levels acts as a routing table to 
redirect queries to
an appropriate leaf level SSTable, in which case, the mcache map corresponding 
to the
root/intermediate level SSTables can be tagged as pinned/hot.

The mcache reaper tracks the access time of each object in an mcache map. On 
memory
pressure, the access time is compared to a time-to-live metric that’s set based 
on the
map’s temperature, how close is the free memory to the low and high watermarks 
etc.
If the object was last accessed outside the ttl window, its pages are evicted 
from the
page cache.

We also apply a few other techniques like throttling the readaheads and adding 
a delay
in the page fault handler to not overwhelm the page cache during memory 
pressure.

In the workloads that we run, we have noticed stalls when kswapd does the 
reclaim and
that impacts throughput and tail latencies as described in our last email. The 
mcache
reaper runs proactively and can make better reclaim decisions as it is designed 
to
address a specific class of workloads.

We doubt whether the same mechanisms can be employed in the vanilla page cache 
as
it is designed to work for a wide variety of workloads.

> > (4) mpool's immutable object model allows the driver to support concurrent
> reading
> > of object data directly and memory-mapped without a performance penalty to
> verify
> > coherence. This allows background operations, such as LSM-tree compaction,
> to
> > operate efficiently and without polluting the page cache.

> How is this different than existing background operations / defrag
> that filesystems perform today? Where are the opportunities to improve
> those operations?

We haven’t measured the benefit of eliminating the coherence check, which isn’t 
needed
in our case because objects are immutable. However the open(2) documentation 
makes
the statement that “applications should avoid mixing mmap(2) of files with 
direct I/O to
the same files”, which is what we are effectively doing when we directly read 
from an
object that is also in an mcache map.

> > (5) Representing an mpool as a /dev/mpool/<mpool-name> device file
> provides a
> > convenient mechanism for controlling access to and managing the multiple
> storage
> > volumes, and in the future pmem devices, that may comprise an logical mpool.

> Christoph and I have talked about replacing the pmem driver's
> dependence on device-mapper for pooling. What extensions would be
> needed for the existing driver arch?

mpool doesn’t extend any of the existing driver arch to manage multiple storage 
volumes.

Mpool implements the concept of media classes, where each media class 
corresponds
to a different storage volume. Clients specify a media class when creating an 
object in
an mpool. mpool currently supports only two media classes, “capacity” for 
storing bulk
of the objects backed by, for instance, QLC SSDs and “staging” for storing 
objects
requiring lower latency/higher throughput backed by, for instance, 3DXP SSDs. 

An mpool is accessed via the /dev/mpool/<mpool-name> device file and the
mpool descriptor attached to this device file instance tracks all its 
associated media
class volumes. mpool relies on device mapper to provide physical device 
aggregation
within a media class volume.

Reply via email to