Re: [DISCUSS] KIP-607: Add Metrics to Record the Memory Used by RocksDB to Kafka Streams

Bruno Cadonna Mon, 27 Jul 2020 02:49:23 -0700

Hi Guozhang,

Do you mean compression or compaction?

Regarding compression, I agree with you except for the merge-sortingpart. The docs describing what is stored in the block cache can be foundunder https://github.com/facebook/rocksdb/wiki/Block-Cache.

Regarding compaction, my statement in my previous e-mail aboutcompaction not using block cache was a guess. To get to bottom of it, Iasked somebody from RockDB and compaction does indeed not use blockcache. Compaction uses the OS to read in the data to compact. But italso uses fadvise to tell the kernel to not cache the data in the OSbuffer cache.


Hope that clears up the conflicts! ;-)

Best,
Bruno

On 24.07.20 19:37, Guozhang Wang wrote:

Ack, thanks for the clarification folks! Yeah I agree from JVM's point all
rocksDB memory are off-heap basically (which makes operations harder,
sigh..)

Regarding the block cache, my understanding is that by default compressed
blocks are in the OS page cache, uncompressed blocks are in the RocksDB
block cache. In Streams, we do not use compression by default, so these
data blocks would be read into the block cache for merge-sorting while
index / bloom filter / other compressed dictionary blocks are read into OS
cache by default. Obviously that conflicts from yours, maybe you can point
me to the related docs?

Guozhang

On Fri, Jul 24, 2020 at 2:15 AM Bruno Cadonna <br...@confluent.io> wrote:

Hi Guozhang and Sophie,

1)
My understanding is also that the memtables are off-heap (as almost
every data structure in RocksDB).

According to the docs, if after a write the size of the memtable exceeds
option write_buffer_size the memtable is flushed. I would not call it
hard bounded since it seems the memtable can exceed this size.

Guozhang's other statements about memtables seem correct to me.

2)
According to the docs, the block cache caches data in memory for reads.
I do not think RocksDB uses the block cache for compaction, because that
would mean each compaction would interfere with the used cache
replacement policy (LRU is the default in Streams). I suppose RocksDB
uses the OS cache during compactions. So block cache usage contains data
blocks for reading and it can also contain index blocks and filter block
if configured accordingly (e.g. by using the BoundedMemoryRocksDBConfig
described under

https://kafka.apache.org/25/documentation/streams/developer-guide/memory-mgmt.html).

The pinned usage are blocks pinned by table readers like iterators.
The block cache can be soft or hard bounded. However, there is currently
an open bug in RocksDB regarding hard-bounded block caches.

3)
The statements seem correct.

The total memory usage seems also correct.

Best,
Bruno

On 23.07.20 20:46, Sophie Blee-Goldman wrote:

Guozhang,

Just to clarify, the "heap" for all these objects is actually the C++

heap,

not the JVM heap. So in the words of a Java application these would
   all be considered "off-heap", right?

(Of course there are some pointers in the Java heap to the off-heap
objects but that size is trivial compared to the actual objects)

Sorry for being pedantic. I just happen to know this is a question that
gets frequently asked so it's probably good to be as clear  as possible
in the KIP/metrics description.

Also, can you clarify a bit what you mean by hard bounded vs soft

bounded?

For example, my impression is that the memtables are actually not hard
bounded at all, while the block cache is soft bounded by default but can
be configured to be hard bounded. And obviously the OS cache is not
exactly bounded but it shouldn't cause you to run out of usable memory
(like the memtables for example might, and have). But I think maybe

you're

using a different definition of hard/soft bounded than I'm thinking of

On Thu, Jul 23, 2020 at 8:07 AM Guozhang Wang <wangg...@gmail.com>

wrote:

Thanks Bruno, they made sense to me.

Regarding the last comment, my main reasoning is that it's better to
explain to users the rocksDB memory usage and link the metrics to

different

categories.

Just to kick off this (and also asking for correction of my own
understanding :) here's what I read from the metrics:

1. Memtables (aka writer buffers, AND read buffers for iterators which
would pin the immutable memtables from flushing). It is allocated

on-heap

and hard-bounded (via memtable_size * max_num_memtables).

     - cur-size-active-mem-table: active
     - cur-size-all-mem-tables: active + unflushed write
     - size-all-mem-tables: active + unflushed write + pinned read

2. Block cache (used for merging / compaction, reads). Allocated on-heap
and soft-bounded.

     - block-cache-usage: compaction + read
     - block-cache-pinned-usage: read

3. OS cache (read buffer), which is the memory usage for filters /

indices

that are outside block cache. Allocated off-heap and not bounded at all.

     - estimate-table-readers-mem


The total memory usage (on-heap and off-heap) is "size-all-mem-tables" +
"block-cache-usage" + "estimate-table-readers-mem".

Is that right?


On Wed, Jul 22, 2020 at 4:28 AM Bruno Cadonna <br...@confluent.io>

wrote:

Hi Guozhang,

Thank you for your feedback!

I answered inline.

Best,
Bruno


On 21.07.20 00:39, Guozhang Wang wrote:

Hello Bruno,

Thanks for the updated KIP. I made a pass and here are some comments:

1) What's the motivation of keeping it as INFO while KIP-471 metrics

are

defined in DEBUG?


The motivation was that the metrics in this KIP do not incur any
performance overhead other than reading out the properties from

RocksDB.

For this metrics RocksDB does not need to maintain anything
additionally. In contrast, for the metrics in KIP-471 RocksDB needs to
maintain the statistics object we pass to it and we also need to switch
on a certain statistics level. So, I thought the metrics in this KIP

are

suited to be used in production and therefore can be reported on INFO
level.

2) Some namings are a bit inconsistent with others and with KIP-471,

for

example:


I am aware of the inconsistencies. I took the names from this list in
the RocksDB repo

https://github.com/facebook/rocksdb/blob/b9a4a10659969c71e6f6eab4e4bae8c36ede919f/include/rocksdb/db.h#L654-L686

(with prefix "rocksdb." ripped off). In this way users do not need to
look up or memorize a mapping between our metrics and the RocksDB
properties. To be clear, those are public RocksDB properties.


2.a) KIP-471 uses "memtable" while in this KIP we use "mem-table",

also

the

"memtable" is prefixed and then the metric name. I'd suggest we keep

them

consistent. e.g. "num-immutable-mem-table" =>

"immutable-memtable-count",

"cur-size-active-mem-table" => "active-memable-bytes"

2.b) "immutable" are abbreviated as "imm" in some names but not in

others,

I'd suggest we do not use abbreviations across all names,
e.g. "num-entries-imm-mem-tables" => "immutable-memtable-num-entries".

2.c) "-size" "-num" semantics is usually a bit unclear, and I'd

suggest

we

just more concrete terms, e.g. "total-sst-files-size" =>
"total-sst-files-bytes", "num-live-versions" => "live-versions-count",
"background-errors" => "background-errors-count".

3) Some metrics are a bit confusing, e.g.

3.a) What's the difference between "cur-size-all-mem-tables" and
"size-all-mem-tables"?


cur-size-all-mem-tables records the approximate size of active and
unflushed immutable memtable. Unflushed immutable memtables are
memtables that are not yet flushed by the asynchronous flushing
mechanism in RocksDB.

size-all-mem-tables records the sizes recorded in
cur-size-all-mem-tables but additionally also records pinned immutable
memtables that that are kept in memory to maintain write history in

memory.


As far as I understood those are memtables that are flushed but there
are still table readers (e.g. iterators) that use those memtables.

I added a sentence to explain the difference.

I guess it is worthwhile to have both of these metrics because if
size-all-mem-tables keeps increasing and cur-size-all-mem-tables not
there may be an issue with the clean-up of table readers.

3.b) And the explanation of "estimate-table-readers-mem" does not read

very

clear to me either, does it refer to

"estimate-sst-file-read-buffer-bytes"?


No, this metric records the memory used by iterators as well as filters
and indices if the filters and indices are not maintained in the block
cache. Basically this metric reports the memory used outside the block
cache to read data. I modified the description to make it clearer.

3.c) How does "estimate-oldest-key-time" help with memory usage

debugging?

I do not consider this KIP to only help with monitoring of memory

usage.

I thought to expose all RocksDB properties that return an integer and
that make sense for Kafka Streams.
Admittedly, I did a bad job in the current KIP to explain this in the
motivation.


4) For my own education, does "estimate-pending-compaction-bytes"

capture

all the memory usage for compaction buffers?


No, as far as I understand, this metric refers to bytes rewritten on
disk. Basically, metric relates to the write amplification for level
compaction. I changed the description.

5) This is just of a nit comment to help readers better understand

rocksDB:

maybe we can explain in the wiki doc which part of rocksDB uses memory
(block cache, OS cache, memtable, compaction buffer, read buffer), and
which of them are on-heap and wich of them are off-heap, which can be

hard

bounded and which can only be soft bounded and which cannot be bounded

at

all, etc.


Good idea! Will look into it!


Guozhang


On Mon, Jul 20, 2020 at 11:00 AM Bruno Cadonna <br...@confluent.io>

wrote:

Hi,

During the implementation of this KIP and after some discussion about
RocksDB metrics, I decided to make some major modifications to this

KIP

and kick off discussion again.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Kafka+Streams+to+Report+Properties+of+RocksDB


Best,
Bruno

On 15.05.20 17:11, Bill Bejeck wrote:

Thanks for the KIP, Bruno. Having sensible, easy to access RocksDB

memory

reporting will be a welcomed addition.

FWIW I also agree to have the metrics reported on a store level. I'm

glad

you changed the KIP to that effect.

-Bill



On Wed, May 13, 2020 at 6:24 PM Guozhang Wang <wangg...@gmail.com>

wrote:

Hi Bruno,

Sounds good to me.

I think I'm just a bit more curious to see its impact on

performance:

as

long as we have one INFO level rocksDB metrics, then we'd have to

turn

on

the scheduled rocksdb metrics recorder whereas previously, we can

decide to

not turn on the recorder at all if all are set as DEBUG and we

configure at

INFO level in production. But this is an implementation detail

anyways

and

maybe the impact is negligible after all. We can check and

re-discuss

this

afterwards :)


Guozhang


On Wed, May 13, 2020 at 9:34 AM Sophie Blee-Goldman <

sop...@confluent.io>

wrote:

Thanks Bruno! I took a look at the revised KIP and it looks good

to

me.


Sophie

On Wed, May 13, 2020 at 6:59 AM Bruno Cadonna <br...@confluent.io

wrote:

Hi John,

Thank you for the feedback!

I agree and I will change the KIP as I stated in my previous

e-mail

to

Guozhang.

Best,
Bruno

On Tue, May 12, 2020 at 3:07 AM John Roesler <

vvcep...@apache.org

wrote:


Thanks, all.

If you don’t mind, I’ll pitch in a few cents’ worth.

In my life I’ve generally found more granular metrics to be more

useful,

as long as there’s a sane way to roll them up. It does seem nice

to

see

it

on the per-store level. For roll-up purposes, the task and thread

tags

should be sufficient.


I think the only reason we make some metrics Debug is that

_recording_

them can be expensive. If there’s no added expense, I think we

can

just

register store-level metrics at Info level.


Thanks for the KIP, Bruno!
-John

On Mon, May 11, 2020, at 17:32, Guozhang Wang wrote:

Hello Sophie / Bruno,

I've also thought about the leveling question, and one

motivation

had for

setting it in instance-level is that we want to expose it in

INFO

level:

today our report leveling is not very finer grained --- which I

think

is

sth. worth itself --- such that one have to either turn on all

DEBUG

metrics recording or none of them. If we can allow users to

e.g.

specify

"turn on streams-metrics and stream-state-metrics, but not

others"

and

then

I think it should be just at store-level. However, right now if

we

want to

set it as store-level then it would be DEBUG and not exposed by

default.


So it seems we have several options in addition to the proposed

one:


a) we set it at store-level as INFO; but then one can argue why

this

is

INFO while others (bytes-written, etc) are DEBUG.
b) we set it at store-level as DEBUG, believing that we do not

usually

need

to turn it on.
c) maybe, we can set it at task-level (? I'm not so sure myself

about

this.. :P) as INFO.


Guozhang




On Mon, May 11, 2020 at 12:29 PM Sophie Blee-Goldman <

sop...@confluent.io>

wrote:

Hey Bruno,

Thanks for the KIP! I have one high-level concern, which is

that

we

should

consider
reporting these metrics on the per-store level rather than

instance-wide. I

know I was
the one who first proposed making it instance-wide, so bear

with

me:


While I would still argue that the instance-wide memory usage

is

probably

the most *useful*,
exposing them at the store-level does not prevent users from

monitoring the

instance-wide
memory. They should be able to roll up all the store-level

metrics

on an

instance to
compute the total off-heap memory. But rolling it up for the

users

does

prevent them from
using this to debug rare cases where one store may be using

significantly

more memory than
expected.

It's also worth considering that some users may be using the

bounded

memory

config setter
to put a cap on the off-heap memory of the entire process, in

which

case

the memory usage
metric for any one store should reflect the memory usage of

the

entire

instance. In that case
any effort to roll up the memory usages ourselves would just

be

wasted.


Sorry for the reversal, but after a second thought I'm pretty

strongly in

favor of reporting these
at the store level.

Best,
Sophie

On Wed, May 6, 2020 at 8:41 AM Bruno Cadonna <

br...@confluent.io

wrote:

Hi all,

I'd like to discuss KIP-607 that aims to add RocksDB memory

usage

metrics to Kafka Streams.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Record+the+Memory+Used+by+RocksDB+to+Kafka+Streams


Best,
Bruno



--
-- Guozhang



--
-- Guozhang



--
-- Guozhang

Re: [DISCUSS] KIP-607: Add Metrics to Record the Memory Used by RocksDB to Kafka Streams

Reply via email to