Re: [DISCUSS] KIP-471: Expose RocksDB Metrics in Kafka Streams

Patrik Kleindl Mon, 15 Jul 2019 11:05:27 -0700

Hi John
Translated it wrong, I meant ‚might‘ instead of ‚may‘.
If I find the time I‘ll take a look at the code how they could be added as 
metrics.
Thanks for your input.
Regards 
Patrik


> Am 15.07.2019 um 18:03 schrieb John Roesler <j...@confluent.io>:
> 
> Hey Patrik,
> 
> Since KIP-471 is already accepted, and since this idea is not a
> trivial extension of the KIP, I think we'd need to do a new KIP.
> 
> Some points to consider: these additions could not be made to the
> KeyValueStore interface, since they'd be only applicable to
> RocksDB-backed stores, but there is no RocksDBKeyValueStore<K,V>,
> which could be used to query the property. Rather, the rocks interface
> is wrapped inside multiple layers that comprise the KeyValueStore.
> Since these layers themselves are not public interfaces, it would be a
> challenge to think of how to get access to these methods from
> user-space code.
> 
> On the other hand, could we just add new metrics to access those
> properties? Offhand, it's unclear to me why you said they "may not be
> exposed".
> 
> Thanks,
> -John
> 
>> On Mon, Jul 15, 2019 at 5:11 AM Patrik Kleindl <pklei...@gmail.com> wrote:
>> 
>> Hi
>> Adding this here because I asked and may have found the answer:
>> The memory consumption may not be exposed as RocksDB metrics, but they
>> should be available as properties of the RocksDB instance itself.
>> RocksDBStore could easily make this available by allowing access
>> to db.getProperty.
>> Available data (examples):
>> rocksdb.block-cache-usage
>> rocksdb.estimate-table-readers-mem
>> rocksdb.cur-size-all-mem-tables
>> 
>> Reference: https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB 
>> and
>> found the names of the properties here
>> https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h
>> 
>> Would it make sense to allow this kind of access?
>> Should this be part of a follow-up KIP?
>> 
>> best regards
>> 
>> Patrik
>> 
>>> On Wed, 19 Jun 2019 at 21:55, Bill Bejeck <bbej...@gmail.com> wrote:
>>> 
>>> Hi Bruno,
>>> 
>>> Just getting caught up on this KIP thread.  Looks good to me, and I don't
>>> have any additional comments to what's already been presented.
>>> 
>>> Thanks,
>>> Bill
>>> 
>>>> On Wed, Jun 19, 2019 at 1:42 PM Bruno Cadonna <br...@confluent.io> wrote:
>>>> 
>>>> John and Guozhang,
>>>> 
>>>> thank you for your comments.
>>>> 
>>>> @Guozhang could you please also vote on the voting thread so that we
>>>> have all votes in one place.
>>>> 
>>>> @John, the only situation I can think of where a non-uniform
>>>> configuration of segments would make sense is to account for
>>>> seasonality. But this would be a really advanced configuration IMO.
>>>> 
>>>> Best,
>>>> Bruno
>>>> 
>>>>> On Wed, Jun 19, 2019 at 7:18 PM John Roesler <j...@confluent.io> wrote:
>>>>> 
>>>>> One last thought. I think it makes sense what you propose for merging
>>>>> the metrics when a logical store is composed of multiple physical
>>>>> stores.
>>>>> 
>>>>> The basic standard for these metrics is that they should be relevant
>>>>> to performance, and they should be controllable via configurations,
>>>>> specifically via RocksDBConfigSetter. The config setter takes as input
>>>>> the store name. For Key/Value stores, this name is visible in the tags
>>>>> as "rocksdb-state-id", so the workflow is that I notice (e.g.) the
>>>>> metric rocksdb-state-id=STORE_X is showing a low block cache hit
>>>>> ratio, so I add a condition in my config setter to increase the block
>>>>> cache size for stores named STORE_X.
>>>>> 
>>>>> For this case where the logical store has multiple physical stores,
>>>>> we're really talking about segmented stores, window stores and
>>>>> segmented stores. In these stores, every segment has a different name
>>>>> (logicalStoreName + "." + segmentId * segmentInterval), so the config
>>>>> setter would need to use a prefix match with respect to the metric tag
>>>>> (e.g.) rocksdb-window-state-id. But at the same time, this is totally
>>>>> doable.
>>>>> 
>>>>> It's also perfectly reasonable, since segments get rotated out all the
>>>>> time, it's implausible that you'd ever want a non-uniform
>>>>> configuration over the segments in a store. For this reason,
>>>>> specifically, it makes more sense just to summarize the metrics than
>>>>> to present them individually. Might be worth documenting the
>>>>> relationship, though.
>>>>> 
>>>>> Thanks again, I'll vote now,
>>>>> -John
>>>>> 
>>>>> On Wed, Jun 19, 2019 at 12:03 PM John Roesler <j...@confluent.io>
>>> wrote:
>>>>>> 
>>>>>> Just taking a look over the metrics again, I had one thought...
>>>>>> 
>>>>>> Stuff that happens in a background thread (like compaction metrics)
>>>>>> can't directly identify compactions as a bottleneck from Streams'
>>>>>> perspective. I.e., a DB might do a lot of compactions, but if those
>>>>>> compactions never delay a write or read, then they cannot be a
>>>>>> bottleneck.
>>>>>> 
>>>>>> Thus, the "stall" metric should be the starting point for bottleneck
>>>>>> identification, and then the flush/compaction metrics can be used to
>>>>>> secondarily identify what to do to relieve the bottleneck.
>>>>>> 
>>>>>> This doesn't affect the metrics you proposed, but I'd suggest saying
>>>>>> something to this effect in whatever documentation or descriptions we
>>>>>> provide.
>>>>>> 
>>>>>> Thanks,
>>>>>> -John
>>>>>> 
>>>>>> On Wed, Jun 19, 2019 at 11:25 AM John Roesler <j...@confluent.io>
>>>> wrote:
>>>>>>> 
>>>>>>> Thanks for the updates.
>>>>>>> 
>>>>>>> Personally, I'd be in favor of not going out on a limb with
>>>>>>> unsupported metrics APIs. We should take care to make sure that
>>> what
>>>>>>> we add in KIP-471 is stable and well supported, even if it's not
>>> the
>>>>>>> complete picture. We can always do follow-on work to tackle complex
>>>>>>> metrics as an isolated design exercise.
>>>>>>> 
>>>>>>> Just my two cents.
>>>>>>> Thanks,
>>>>>>> -John
>>>>>>> 
>>>>>>> On Wed, Jun 19, 2019 at 6:02 AM Bruno Cadonna <br...@confluent.io>
>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi Guozhang,
>>>>>>>> 
>>>>>>>> Regarding your comments about the wiki page:
>>>>>>>> 
>>>>>>>> 1) Exactly, I rephrased the paragraph to make it more clear.
>>>>>>>> 
>>>>>>>> 2) Yes, I used the wrong term. All hit related metrics are
>>> ratios.
>>>> I
>>>>>>>> corrected the names of the affected metrics.
>>>>>>>> 
>>>>>>>> Regarding your meta comments:
>>>>>>>> 
>>>>>>>> 1) The plan is to expose the hit ratio. I used the wrong term.
>>> The
>>>>>>>> formulas compute ratios. Regarding your question about a metric
>>> to
>>>>>>>> know from where a read is served when it is not in the memtable,
>>>> there
>>>>>>>> are metrics in RocksDB that give you the number of get() queries
>>>> that
>>>>>>>> are served from L0, L1, and L2_AND_UP. I could not find any
>>> metric
>>>>>>>> that give you information about whether a query was served from
>>>> disk
>>>>>>>> vs. OS cache. One metric that could be used to indirectly measure
>>>>>>>> whether disk or OS cache is accessed seems to be
>>>> READ_BLOCK_GET_MICROS
>>>>>>>> that gives you the time for an IO read of a block. If it is high,
>>>> it
>>>>>>>> was read from disk, otherwise from the OS cache. A similar
>>>> strategy to
>>>>>>>> monitor the performance is described in [1]. DISCLAIMER:
>>>>>>>> READ_BLOCK_GET_MICROS is not documented. I had to look into the
>>> C++
>>>>>>>> code to understand its meaning. I could have missed something.
>>>>>>>> 
>>>>>>>> 2) There are some additional compaction statistics that contain
>>>> sizes
>>>>>>>> of files on disk and numbers about write amplification that you
>>> can
>>>>>>>> get programmatically in RocksDB, but they are for debugging
>>>> purposes
>>>>>>>> [2]. To get this data and publish it into a metric, one has to
>>>> parse a
>>>>>>>> string. Since this data is for debugging purposes, I do not know
>>>> how
>>>>>>>> stable the output format is. One thing, we could do, is to dump
>>> the
>>>>>>>> string with the compaction statistics into our log files at DEBUG
>>>>>>>> level. But that is outside of the scope of this KIP.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Bruno
>>>>>>>> 
>>>>>>>> [1]
>>>> 
>>> https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context#block-cache-and-os-page-cache-efficiency
>>>>>>>> [2]
>>>> 
>>> https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#rocksdb-statistics
>>>>>>>> 
>>>>>>>> On Tue, Jun 18, 2019 at 8:24 PM Guozhang Wang <
>>> wangg...@gmail.com>
>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hello Bruno,
>>>>>>>>> 
>>>>>>>>> I've read through the aggregation section and I think they look
>>>> good to me.
>>>>>>>>> There are a few minor comments about the wiki page itself:
>>>>>>>>> 
>>>>>>>>> 1) A state store might consist of multiple state stores -> You
>>>> mean a
>>>>>>>>> `logical` state store be consistent of multiple `physical`
>>> store
>>>> instances?
>>>>>>>>> 
>>>>>>>>> 2) The "Hit Rates" calculation seems to be referring to the
>>> `Hit
>>>> Ratio`
>>>>>>>>> (which is a percentage) than `Hit Rate`?
>>>>>>>>> 
>>>>>>>>> And a couple further meta comments:
>>>>>>>>> 
>>>>>>>>> 1) For memtable / block cache, instead of the hit-rate do you
>>>> think we
>>>>>>>>> should expose the hit-ratio? I felt it is more useful for users
>>>> to debug
>>>>>>>>> what's the root cause of unexpected slow performance.
>>>>>>>>> 
>>>>>>>>> And for block cache misses, is it easy to provide a metric as
>>> of
>>>> "target
>>>>>>>>> read" of where a read is served (from which level, either in OS
>>>> cache or in
>>>>>>>>> SST files), similar to Fig.11 in
>>>>>>>>> http://cidrdb.org/cidr2017/papers/p82-dong-cidr17.pdf?
>>>>>>>>> 
>>>>>>>>> 2) As @Patrik mentioned, is there a good way we can expose the
>>>> total amount
>>>>>>>>> of memory and disk usage for each state store as well? I think
>>>> it would
>>>>>>>>> also be very helpful for users to understand their capacity
>>>> needs and read
>>>>>>>>> / write amplifications.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Guozhang
>>>>>>>>> 
>>>>>>>>> On Fri, Jun 14, 2019 at 6:55 AM Bruno Cadonna <
>>>> br...@confluent.io> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I decided to go for the option in which metrics are exposed
>>>> for each
>>>>>>>>>> logical state store. I revisited the KIP correspondingly and
>>>> added a
>>>>>>>>>> section on how to aggregate metrics over multiple physical
>>>> RocksDB
>>>>>>>>>> instances within one logical state store. Would be great, if
>>>> you could
>>>>>>>>>> take a look and give feedback. If nobody has complaints about
>>>> the
>>>>>>>>>> chosen option I would proceed with voting on this KIP since
>>>> this was
>>>>>>>>>> the last open question.
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Bruno
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jun 7, 2019 at 9:38 PM Patrik Kleindl <
>>>> pklei...@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Sophie
>>>>>>>>>>> This will be a good change, I have been thinking about
>>>> proposing
>>>>>>>>>> something similar or even passing the properties per store.
>>>>>>>>>>> RocksDB should probably know how much memory was reserved
>>>> but maybe does
>>>>>>>>>> not expose it.
>>>>>>>>>>> We are limiting it already as you suggested but this is a
>>>> rather crude
>>>>>>>>>> tool.
>>>>>>>>>>> Especially in a larger topology with mixed loads par topic
>>>> it would be
>>>>>>>>>> helpful to get more insights which store puts a lot of load
>>> on
>>>> memory.
>>>>>>>>>>> Regarding the limiting capability, I think I remember
>>>> reading that those
>>>>>>>>>> only affect some parts of the memory and others can still
>>>> exceed this
>>>>>>>>>> limit. I‘ll try to look up the difference.
>>>>>>>>>>> Best regards
>>>>>>>>>>> Patrik
>>>>>>>>>>> 
>>>>>>>>>>>> Am 07.06.2019 um 21:03 schrieb Sophie Blee-Goldman <
>>>>>>>>>> sop...@confluent.io>:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Patrik,
>>>>>>>>>>>> 
>>>>>>>>>>>> As of 2.3 you will be able to use the RocksDBConfigSetter
>>>> to
>>>>>>>>>> effectively
>>>>>>>>>>>> bound the total memory used by RocksDB for a single app
>>>> instance. You
>>>>>>>>>>>> should already be able to limit the memory used per
>>>> rocksdb store,
>>>>>>>>>> though
>>>>>>>>>>>> as you mention there can be a lot of them. I'm not sure
>>>> you can
>>>>>>>>>> monitor the
>>>>>>>>>>>> memory usage if you are not limiting it though.
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Jun 7, 2019 at 2:06 AM Patrik Kleindl <
>>>> pklei...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi
>>>>>>>>>>>>> Thanks Bruno for the KIP, this is a very good idea.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have one question, are there metrics available for the
>>>> memory
>>>>>>>>>> consumption
>>>>>>>>>>>>> of RocksDB?
>>>>>>>>>>>>> As they are running outside the JVM we have run into
>>>> issues because
>>>>>>>>>> they
>>>>>>>>>>>>> were using all the other memory.
>>>>>>>>>>>>> And with multiple streams applications on the same
>>>> machine, each with
>>>>>>>>>>>>> several KTables and 10+ partitions per topic the number
>>>> of stores can
>>>>>>>>>> get
>>>>>>>>>>>>> out of hand pretty easily.
>>>>>>>>>>>>> Or did I miss something obvious how those can be
>>>> monitored better?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> best regards
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Patrik
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, 17 May 2019 at 23:54, Bruno Cadonna <
>>>> br...@confluent.io>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> this KIP describes the extension of the Kafka Streams'
>>>> metrics to
>>>>>>>>>> include
>>>>>>>>>>>>>> RocksDB's internal statistics.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Please have a look at it and let me know what you
>>> think.
>>>> Since I am
>>>>>>>>>> not a
>>>>>>>>>>>>>> RocksDB expert, I am thankful for any additional pair
>>> of
>>>> eyes that
>>>>>>>>>>>>>> evaluates this KIP.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-471:+Expose+RocksDB+Metrics+in+Kafka+Streams
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Bruno
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> -- Guozhang
>>>> 
>>>

Re: [DISCUSS] KIP-471: Expose RocksDB Metrics in Kafka Streams

Reply via email to