I'm not sure we can safely assume only the most recent segment is hot. Anything within the current window size is still being actively queried, and users can independently set windowSize and retentionPeriod as long as windowSize <= retentionPeriod. But the default segmentInterval is max(retentionPeriod / 2, 60,000 ms). So if a store had windowSize ≈ retentionPeriod then the current window actually spans the latest two or three stores. We could consider just providing metrics for however many stores fit within windowSize but that will basically be either all/most of them, or leave out anything being queried by IQ.
Of course the IQ access patterns are probably fairly different so there's some question of whether they should be included in the metrics, but I don't see how you could get rocksdb to distinguish which reads are coming from Streams and which are IQ. You could trim some of them by only reporting metrics for stores within the windowSize as mentioned above, but then you'd still be mixing in IQ that just happened to be in the latest stores. It's probably better to include all IQ or none, and I'm not sure how you could accomplish none. On Thu, Jun 6, 2019 at 12:14 PM Bruno Cadonna <br...@confluent.io> wrote: > Hi, > > I like the idea of just exposing the metrics of the latest segment. I > think it gives the most realistic picture of the current operations on > the segmented RocksDB without exposing implementation details. The > cons of this approach is that during the switch to a new segment the > values of some metrics might behave a bit strangely because of the > empty memtable and empty caches of the new segment. I will follow up > on this idea and see what I need to change in the KIP. > > Best, > Bruno > > > > > On Thu, Jun 6, 2019 at 2:33 AM Guozhang Wang <wangg...@gmail.com> wrote: > > > > I think Bruno's 2) is that for a segmented store, the access rate on > > different segments will very likely be different. And in fact, most of > the > > access should be on the "latest" segment unless 1) very late arrived > data, > > which should be captured on the higher-level `lateness` metrics already, > > and 2) IQ reads on old windows. The problem is that, say if 99% of reads > go > > to the latest segment, and 1% goes to rest of the segments, how should > > `memtable-hit-rate` be calculated then. > > > > Another wild thought just to throw here: maybe we can just expose the > > latest segment's state store as the logical store's metrics? Admittedly > it > > would not be most accurate, but it is 1) future-proof if we want to > > consolidate to 1-1 physical store -> logical store implementation, and 2) > > it is as simple and not needing to bookkeep older segments who should be > > rarely accessed. My question is though, if upon segment rolling our > metrics > > can be auto-switched to the new store. > > > > > > Guozhang > > > > On Tue, Jun 4, 2019 at 3:06 PM Sophie Blee-Goldman <sop...@confluent.io> > > wrote: > > > > > Hey Bruno, > > > > > > I tend to agree with Guozhang on this matter although you do bring up > some > > > good points that should be addressed. Regarding 1) I think it is > probably > > > fairly uncommon in practice for users to leverage the individual store > > > names passed to RocksDBConfigSetter#setConfig in order to specify > options > > > on a per-store basis. When this actually is used, it does seem likely > that > > > users would be doing something like pattern matching the physical store > > > name prefix in order to apply configs to all physical stores (segments) > > > within a single logical RocksDBStore. As you mention this is something > of a > > > hassle already as physical stores are created/deleted, while most > likely > > > all anyone cares about is the prefix corresponding to the logical > store. It > > > seems like rather than persist this hassle to the metric layer, we > should > > > consider refactoring RocksDBConfigSetter to apply to a logical store > rather > > > than a specific physical segment. Or maybe providing some kind of > tooling > > > to at least make this easier on users, but that's definitely outside > the > > > scope of this KIP. > > > > > > Regarding 2) can you clarify your point about accessing stores > uniformly? > > > While I agree there will definitely be variance in the access pattern > of > > > different segments, I think it's unlikely that it will vary in any > kind of > > > predictable or deterministic way, hence it is not that useful to know > in > > > hindsight the difference reflected by the metrics. > > > > > > Cheers, > > > Sophie > > > > > > On Tue, Jun 4, 2019 at 2:09 PM Bruno Cadonna <br...@confluent.io> > wrote: > > > > > > > Hi Guozhang, > > > > > > > > After some thoughts, I tend to be in favour of the option with > metrics > > > > for each physical RocksDB instance for the following reasons: > > > > > > > > 1) A user already needs to be aware of segmented state stores when > > > > providing a custom RocksDBConfigSetter. In RocksDBConfigSetter one > can > > > > specify settings for a store depending on the name of the store. > Since > > > > segments (i.e. state store) in a segmented state store have names > that > > > > share a prefix but have suffixes that are created at runtime, > increase > > > > with time and are theoretically unbounded, a user needs to take > > > > account of the segments to provide the settings for all (i.e. > matching > > > > the common prefix) or some (i.e. matching the common prefix and for > > > > example suffixes according to a specific pattern) of the segments of > a > > > > specific segmented state store. > > > > 2) Currently settings for RocksDB can only be specified by a user per > > > > physical instance and not per logical instance. Deriving good > settings > > > > for physical instances from metrics for a logical instance can be > hard > > > > if the physical instances are not accessed uniformly. In segmented > > > > state stores segments are not accessed uniformly. > > > > 3) Simpler to implement and to get things done. > > > > > > > > Any thoughts on this from anybody? > > > > > > > > Best, > > > > Bruno > > > > > > > > On Thu, May 30, 2019 at 8:33 PM Guozhang Wang <wangg...@gmail.com> > > > wrote: > > > > > > > > > > Hi Bruno: > > > > > > > > > > Regarding 2) I think either way has some shortcomings: exposing the > > > > metrics > > > > > per rocksDB instance for window / session stores exposed some > > > > > implementation internals (that we use segmented stores) to enforce > > > users > > > > to > > > > > be aware of them. E.g. what if we want to silently change the > internal > > > > > implementation by walking away from the segmented approach? On the > > > other > > > > > hand, coalescing multiple rocksDB instances' metrics into a single > one > > > > per > > > > > each logical store also has some concerns as I mentioned above. > What > > > I'm > > > > > thinking is actually that, if we can customize the aggregation > logic to > > > > > still has one set of metrics per each logical store which may be > > > composed > > > > > of multiple rocksDB ones -- e.g. for `bytes-written-rate` we sum > them > > > > > across rocksDBs, while for `memtable-hit-rate` we do weighted > average? > > > > > > > > > > Regarding logging levels, I think have DEBUG is fine, but also that > > > means > > > > > without manually turning it on users would not get it. Maybe we > should > > > > > consider adding some dynamic setting mechanisms in the future to > allow > > > > > users turn it on / off during run-time. > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > > > On Tue, May 28, 2019 at 6:23 AM Bruno Cadonna <br...@confluent.io> > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > Thank you for your comments. > > > > > > > > > > > > @Bill: > > > > > > > > > > > > 1. It is like Guozhang wrote: > > > > > > - rocksdb-state-id is for key-value stores > > > > > > - rocksdb-session-state-id is for session stores > > > > > > - rocksdb-window-state-id is for window stores > > > > > > These tags are defined in the corresponding store builders and I > > > think > > > > > > it is a good idea to re-use them. > > > > > > > > > > > > 2. I could not find any exposed ticker or histogram to get the > total > > > > > > and average number of compactions, although RocksDB dumps the > number > > > > > > of compactions between levels in its log files. There is the > > > > > > NUM_SUBCOMPACTIONS_SCHEDULED histogram that gives you statistics > > > about > > > > > > the number of subcompactions actually scheduled during a > compaction, > > > > > > but that is not that what you are looking for. If they will > expose > > > the > > > > > > number of compaction in the future, we can still add the metrics > you > > > > > > propose. I guess, the metric in this KIP that would indirectly be > > > > > > influenced by the number of L0 files would be > write-stall-duration. > > > If > > > > > > there are too many compactions this duration should increase. > > > However, > > > > > > this metric is also influenced by memtable flushes. > > > > > > > > > > > > @John: > > > > > > > > > > > > I think it is a good idea to prefix the flush-related metrics > with > > > > > > memtable to avoid ambiguity. I changed the KIP accordingly. > > > > > > > > > > > > @Dongjin: > > > > > > > > > > > > For the tag and compaction-related comments, please see my > answers to > > > > Bill. > > > > > > > > > > > > I cannot follow your second paragraph. Are you saying that a > tuning > > > > > > guide for RocksDB within Streams based on the metrics in this > KIP is > > > > > > out of scope? I also think that it doesn't need to be included in > > > this > > > > > > KIP, but it is worth to work on it afterwards. > > > > > > > > > > > > @Guozhang: > > > > > > > > > > > > 1. Thank you for the explanation. I missed that. I modified the > KIP > > > > > > accordingly. > > > > > > > > > > > > 2. No, my plan is to record and expose the set of metrics for > each > > > > > > RocksDB store separately. Each set of metrics can then be > > > > > > distinguished by its store ID. Do I miss something here? > > > > > > > > > > > > 3. I agree with you and Sophie about user education and that we > > > should > > > > > > work on it after this KIP. > > > > > > > > > > > > 4. I agree also on the user API. However, I would like to open a > > > > > > separate KIP for it because I still need a bit of thinking to > get it. > > > > > > I also think it is a good idea to do one step after the other to > get > > > > > > at least the built-in RocksDB metrics into the next release. > > > > > > Do you think I chose too many metrics as built-in metrics for > this > > > > > > KIP? What do others think? > > > > > > > > > > > > @Sophie: > > > > > > > > > > > > I tend to DEBUG level, but I do not have heart feelings about > it. Do > > > > > > you mean to turn it on/off RocksDB metrics in the Streams > > > > > > configuration? > > > > > > > > > > > > Best, > > > > > > Bruno > > > > > > > > > > > > On Tue, May 21, 2019 at 8:02 PM Sophie Blee-Goldman < > > > > sop...@confluent.io> > > > > > > wrote: > > > > > > > > > > > > > > I definitely agree with Guozhang's "meta" comment: if it's > possible > > > > to > > > > > > > allow users to pick and choose individual RocksDB metrics that > > > would > > > > be > > > > > > > ideal. One further question is whether these will be debug or > info > > > > level > > > > > > > metrics, or a separate level altogether? If there is a > nontrivial > > > > > > overhead > > > > > > > associated with attaching RocksDB metrics it would probably be > good > > > > to be > > > > > > > able to independently turn on/off Rocks metrics > > > > > > > > > > > > > > On Tue, May 21, 2019 at 9:00 AM Guozhang Wang < > wangg...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > Hello Bruno, > > > > > > > > > > > > > > > > Thanks for the KIP, I have a few minor comments and a meta > one > > > > which > > > > > > are > > > > > > > > relatively aligned with other folks: > > > > > > > > > > > > > > > > Minor: > > > > > > > > > > > > > > > > 1) Regarding the "rocksdb-state-id = [store ID]", to be > > > consistent > > > > with > > > > > > > > other state store metrics (see > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams > > > > > > ), > > > > > > > > this tag should be either "rocksdb-window-state-id", > > > > > > > > "rocksdb-session-state-id" or "rocksdb-state-id". For window > / > > > > session > > > > > > > > store backed by rocksDB, the tags should not be > > > "rocksdb-state-id". > > > > > > > > > > > > > > > > 2) Also for window / session store, my take is that you plan > to > > > > have > > > > > > > > multiple rocksDB behind the scene to report to the same set > of > > > > > > metrics, is > > > > > > > > that right? My concern is that for such types of state > stores, > > > > most of > > > > > > the > > > > > > > > access would be on the first segment rocksDB instance, and > hence > > > > > > coalescing > > > > > > > > all of instances as a single set of metrics may dilute it. > > > > > > > > > > > > > > > > 3) I agree with @sop...@confluent.io <sop...@confluent.io> > that > > > we > > > > > > should > > > > > > > > better have some documentation educating users what to do > when > > > see > > > > what > > > > > > > > anomalies in metrics; though I think this is a long > endeavoring > > > > effort > > > > > > that > > > > > > > > may go beyond this KIP's scope, let's keep that as a separate > > > > umbrella > > > > > > JIRA > > > > > > > > to accumulate knowledge over time. > > > > > > > > > > > > > > > > > > > > > > > > Meta: > > > > > > > > > > > > > > > > 4) Instead of trying to enumerate all the ones that might be > > > > helpful, > > > > > > I'd > > > > > > > > recommend that we expose a user-friendly API in > StreamsMetrics to > > > > allow > > > > > > > > users to add more metrics from RocksDB they'd like to have, > while > > > > only > > > > > > > > keeping a small set of most-meaningful ones that are > ubiquitously > > > > > > useful > > > > > > > > out-of-the-box. WDYT? > > > > > > > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, May 21, 2019 at 8:04 AM Dongjin Lee < > dong...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > > > >> Hi Bruno, > > > > > > > >> > > > > > > > >> I just read the KIP. I think this feature is great. As far > as I > > > > know, > > > > > > most > > > > > > > >> Kafka users monitor the host resources, JVM resources, and > Kafka > > > > > > metrics > > > > > > > >> only, not RocksDB for configuring the statistics feature is > a > > > > little > > > > > > bit > > > > > > > >> tiresome. Since RocksDB impacts the performance of Kafka > > > Streams, > > > > I > > > > > > > >> believe > > > > > > > >> providing these metrics with other metrics in one place is > much > > > > > > better. > > > > > > > >> > > > > > > > >> However, I am a little bit not assured for how much > information > > > > > > should be > > > > > > > >> provided to the users with the documentation - how the user > can > > > > > > control > > > > > > > >> the > > > > > > > >> RocksDB may on the boundary of the scope. How do you think? > > > > > > > >> > > > > > > > >> +1. I entirely agree to Bill's comments; I also think > > > > > > `rocksdb-store-id` > > > > > > > >> is > > > > > > > >> better than `rocksdb-state-id` and metrics on total > compactions > > > > and an > > > > > > > >> average number of compactions is also needed. > > > > > > > >> > > > > > > > >> Regards, > > > > > > > >> Dongjin > > > > > > > >> > > > > > > > >> On Tue, May 21, 2019 at 2:48 AM John Roesler < > j...@confluent.io > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > >> > Hi Bruno, > > > > > > > >> > > > > > > > > >> > Looks really good overall. This is going to be an awesome > > > > addition. > > > > > > > >> > > > > > > > > >> > My only thought was that we have > "bytes-flushed-(rate|total) > > > and > > > > > > > >> > flush-time-(avg|min|max)" metrics, and the description > states > > > > that > > > > > > > >> > these are specifically for Memtable flush operations. > What do > > > > you > > > > > > > >> > think about calling it "memtable-bytes-flushed... (etc)"? > On > > > one > > > > > > hand, > > > > > > > >> > I could see this being redundant, since that's the only > thing > > > > that > > > > > > > >> > gets "flushed" inside of Rocks. But on the other, we have > an > > > > > > > >> > independent "flush" operation in streams, which might be > > > > confusing. > > > > > > > >> > Plus, it might help people who are looking at the full > "menu" > > > of > > > > > > > >> > hundreds of metrics. They can't read and remember every > > > > description > > > > > > > >> > while trying to understand the full list of metrics, so > going > > > > for > > > > > > > >> > maximum self-documentation in the name seems nice. > > > > > > > >> > > > > > > > > >> > But that's a minor thought. Modulo the others' comments, > this > > > > looks > > > > > > good > > > > > > > >> > to me. > > > > > > > >> > > > > > > > > >> > Thanks, > > > > > > > >> > -John > > > > > > > >> > > > > > > > > >> > On Mon, May 20, 2019 at 11:07 AM Bill Bejeck < > > > bbej...@gmail.com > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > > > >> > > Hi Bruno, > > > > > > > >> > > > > > > > > > >> > > Thanks for the KIP, this will be a useful addition. > > > > > > > >> > > > > > > > > > >> > > Overall the KIP looks good to me, and I have two minor > > > > comments. > > > > > > > >> > > > > > > > > > >> > > 1. For the tags should, I'm wondering if > rocksdb-state-id > > > > should > > > > > > be > > > > > > > >> > > rocksdb-store-id > > > > > > > >> > > instead? > > > > > > > >> > > > > > > > > > >> > > 2. With the compaction metrics, would it be possible to > add > > > > total > > > > > > > >> > > compactions and an average number of compactions? I've > > > taken > > > > a > > > > > > look > > > > > > > >> at > > > > > > > >> > the > > > > > > > >> > > available RocksDB metrics, and I'm not sure. But users > can > > > > > > control > > > > > > > >> how > > > > > > > >> > > many L0 files it takes to trigger compaction so if it is > > > > > > possible; it > > > > > > > >> may > > > > > > > >> > > be useful. > > > > > > > >> > > > > > > > > > >> > > Thanks, > > > > > > > >> > > Bill > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > On Mon, May 20, 2019 at 9:15 AM Bruno Cadonna < > > > > br...@confluent.io > > > > > > > > > > > > > > >> > wrote: > > > > > > > >> > > > > > > > > > >> > > > Hi Sophie, > > > > > > > >> > > > > > > > > > > >> > > > Thank you for your comments. > > > > > > > >> > > > > > > > > > > >> > > > It's a good idea to supplement the metrics with > > > > configuration > > > > > > option > > > > > > > >> > > > to change the metrics. I also had some thoughts about > it. > > > > > > However, I > > > > > > > >> > > > think I need some experimentation to get this right. > > > > > > > >> > > > > > > > > > > >> > > > I added the block cache hit rates for index and filter > > > > blocks > > > > > > to the > > > > > > > >> > > > KIP. As far as I understood, they should stay at > zero, if > > > > users > > > > > > do > > > > > > > >> not > > > > > > > >> > > > configure RocksDB to include index and filter blocks > into > > > > the > > > > > > block > > > > > > > >> > > > cache. Did you also understand this similarly? I guess > > > also > > > > in > > > > > > this > > > > > > > >> > > > case some experimentation would be good to be sure. > > > > > > > >> > > > > > > > > > > >> > > > Best, > > > > > > > >> > > > Bruno > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > On Sat, May 18, 2019 at 2:29 AM Sophie Blee-Goldman < > > > > > > > >> > sop...@confluent.io> > > > > > > > >> > > > wrote: > > > > > > > >> > > > > > > > > > > > >> > > > > Actually I wonder if it might be useful to users to > be > > > > able to > > > > > > > >> break > > > > > > > >> > up > > > > > > > >> > > > the > > > > > > > >> > > > > cache hit stats by type? Some people may choose to > store > > > > > > index and > > > > > > > >> > filter > > > > > > > >> > > > > blocks alongside data blocks, and it would probably > be > > > > very > > > > > > > >> helpful > > > > > > > >> > for > > > > > > > >> > > > > them to know who is making more effective use of the > > > > cache in > > > > > > > >> order > > > > > > > >> > to > > > > > > > >> > > > tune > > > > > > > >> > > > > how much of it is allocated to each. I'm not sure > how > > > > common > > > > > > this > > > > > > > >> > really > > > > > > > >> > > > is > > > > > > > >> > > > > but I think it would be invaluable to those who do. > > > > RocksDB > > > > > > > >> > performance > > > > > > > >> > > > can > > > > > > > >> > > > > be quite opaque.. > > > > > > > >> > > > > > > > > > > > >> > > > > Cheers, > > > > > > > >> > > > > Sophie > > > > > > > >> > > > > > > > > > > > >> > > > > On Fri, May 17, 2019 at 5:01 PM Sophie Blee-Goldman > < > > > > > > > >> > sop...@confluent.io > > > > > > > >> > > > > > > > > > > > >> > > > > wrote: > > > > > > > >> > > > > > > > > > > > >> > > > > > Hey Bruno! > > > > > > > >> > > > > > > > > > > > > >> > > > > > This all looks pretty good to me, but one > suggestion I > > > > have > > > > > > is > > > > > > > >> to > > > > > > > >> > > > > > supplement each of the metrics with some info on > how > > > the > > > > > > user > > > > > > > >> can > > > > > > > >> > > > control > > > > > > > >> > > > > > them. In other words, which options could/should > they > > > > set in > > > > > > > >> > > > > > RocksDBConfigSetter should they discover a > particular > > > > > > > >> bottleneck? > > > > > > > >> > > > > > > > > > > > > >> > > > > > I don't think this necessarily needs to go into > the > > > KIP, > > > > > > but I > > > > > > > >> do > > > > > > > >> > > > think it > > > > > > > >> > > > > > should be included in the docs somewhere (happy to > > > help > > > > > > build up > > > > > > > >> > the > > > > > > > >> > > > list > > > > > > > >> > > > > > of associated options when the time comes) > > > > > > > >> > > > > > > > > > > > > >> > > > > > Thanks! > > > > > > > >> > > > > > Sophie > > > > > > > >> > > > > > > > > > > > > >> > > > > > On Fri, May 17, 2019 at 2:54 PM Bruno Cadonna < > > > > > > > >> br...@confluent.io> > > > > > > > >> > > > wrote: > > > > > > > >> > > > > > > > > > > > > >> > > > > >> Hi all, > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> this KIP describes the extension of the Kafka > > > Streams' > > > > > > metrics > > > > > > > >> to > > > > > > > >> > > > include > > > > > > > >> > > > > >> RocksDB's internal statistics. > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> Please have a look at it and let me know what you > > > > think. > > > > > > Since > > > > > > > >> I > > > > > > > >> > am > > > > > > > >> > > > not a > > > > > > > >> > > > > >> RocksDB expert, I am thankful for any additional > pair > > > > of > > > > > > eyes > > > > > > > >> that > > > > > > > >> > > > > >> evaluates this KIP. > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-471:+Expose+RocksDB+Metrics+in+Kafka+Streams > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> Best regards, > > > > > > > >> > > > > >> Bruno > > > > > > > >> > > > > >> > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> -- > > > > > > > >> *Dongjin Lee* > > > > > > > >> > > > > > > > >> *A hitchhiker in the mathematical world.* > > > > > > > >> *github: <http://goog_969573159/>github.com/dongjinleekr > > > > > > > >> <https://github.com/dongjinleekr>linkedin: > > > > > > > >> kr.linkedin.com/in/dongjinleekr > > > > > > > >> <https://kr.linkedin.com/in/dongjinleekr>speakerdeck: > > > > > > > >> speakerdeck.com/dongjin > > > > > > > >> <https://speakerdeck.com/dongjin>* > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > -- Guozhang > > > > > > > > > > > > > -- > > -- Guozhang >