Hello Bruno, Thanks for the updated KIP. I made a pass and here are some comments:
1) What's the motivation of keeping it as INFO while KIP-471 metrics are defined in DEBUG? 2) Some namings are a bit inconsistent with others and with KIP-471, for example: 2.a) KIP-471 uses "memtable" while in this KIP we use "mem-table", also the "memtable" is prefixed and then the metric name. I'd suggest we keep them consistent. e.g. "num-immutable-mem-table" => "immutable-memtable-count", "cur-size-active-mem-table" => "active-memable-bytes" 2.b) "immutable" are abbreviated as "imm" in some names but not in others, I'd suggest we do not use abbreviations across all names, e.g. "num-entries-imm-mem-tables" => "immutable-memtable-num-entries". 2.c) "-size" "-num" semantics is usually a bit unclear, and I'd suggest we just more concrete terms, e.g. "total-sst-files-size" => "total-sst-files-bytes", "num-live-versions" => "live-versions-count", "background-errors" => "background-errors-count". 3) Some metrics are a bit confusing, e.g. 3.a) What's the difference between "cur-size-all-mem-tables" and "size-all-mem-tables"? 3.b) And the explanation of "estimate-table-readers-mem" does not read very clear to me either, does it refer to "estimate-sst-file-read-buffer-bytes"? 3.c) How does "estimate-oldest-key-time" help with memory usage debugging? 4) For my own education, does "estimate-pending-compaction-bytes" capture all the memory usage for compaction buffers? 5) This is just of a nit comment to help readers better understand rocksDB: maybe we can explain in the wiki doc which part of rocksDB uses memory (block cache, OS cache, memtable, compaction buffer, read buffer), and which of them are on-heap and wich of them are off-heap, which can be hard bounded and which can only be soft bounded and which cannot be bounded at all, etc. Guozhang On Mon, Jul 20, 2020 at 11:00 AM Bruno Cadonna <br...@confluent.io> wrote: > Hi, > > During the implementation of this KIP and after some discussion about > RocksDB metrics, I decided to make some major modifications to this KIP > and kick off discussion again. > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Kafka+Streams+to+Report+Properties+of+RocksDB > > Best, > Bruno > > On 15.05.20 17:11, Bill Bejeck wrote: > > Thanks for the KIP, Bruno. Having sensible, easy to access RocksDB memory > > reporting will be a welcomed addition. > > > > FWIW I also agree to have the metrics reported on a store level. I'm glad > > you changed the KIP to that effect. > > > > -Bill > > > > > > > > On Wed, May 13, 2020 at 6:24 PM Guozhang Wang <wangg...@gmail.com> > wrote: > > > >> Hi Bruno, > >> > >> Sounds good to me. > >> > >> I think I'm just a bit more curious to see its impact on performance: as > >> long as we have one INFO level rocksDB metrics, then we'd have to turn > on > >> the scheduled rocksdb metrics recorder whereas previously, we can > decide to > >> not turn on the recorder at all if all are set as DEBUG and we > configure at > >> INFO level in production. But this is an implementation detail anyways > and > >> maybe the impact is negligible after all. We can check and re-discuss > this > >> afterwards :) > >> > >> > >> Guozhang > >> > >> > >> On Wed, May 13, 2020 at 9:34 AM Sophie Blee-Goldman < > sop...@confluent.io> > >> wrote: > >> > >>> Thanks Bruno! I took a look at the revised KIP and it looks good to me. > >>> > >>> Sophie > >>> > >>> On Wed, May 13, 2020 at 6:59 AM Bruno Cadonna <br...@confluent.io> > >> wrote: > >>> > >>>> Hi John, > >>>> > >>>> Thank you for the feedback! > >>>> > >>>> I agree and I will change the KIP as I stated in my previous e-mail to > >>>> Guozhang. > >>>> > >>>> Best, > >>>> Bruno > >>>> > >>>> On Tue, May 12, 2020 at 3:07 AM John Roesler <vvcep...@apache.org> > >>> wrote: > >>>>> > >>>>> Thanks, all. > >>>>> > >>>>> If you don’t mind, I’ll pitch in a few cents’ worth. > >>>>> > >>>>> In my life I’ve generally found more granular metrics to be more > >>> useful, > >>>> as long as there’s a sane way to roll them up. It does seem nice to > see > >>> it > >>>> on the per-store level. For roll-up purposes, the task and thread tags > >>>> should be sufficient. > >>>>> > >>>>> I think the only reason we make some metrics Debug is that > >> _recording_ > >>>> them can be expensive. If there’s no added expense, I think we can > just > >>>> register store-level metrics at Info level. > >>>>> > >>>>> Thanks for the KIP, Bruno! > >>>>> -John > >>>>> > >>>>> On Mon, May 11, 2020, at 17:32, Guozhang Wang wrote: > >>>>>> Hello Sophie / Bruno, > >>>>>> > >>>>>> I've also thought about the leveling question, and one motivation I > >>>> had for > >>>>>> setting it in instance-level is that we want to expose it in INFO > >>>> level: > >>>>>> today our report leveling is not very finer grained --- which I > >> think > >>>> is > >>>>>> sth. worth itself --- such that one have to either turn on all > >> DEBUG > >>>>>> metrics recording or none of them. If we can allow users to e.g. > >>>> specify > >>>>>> "turn on streams-metrics and stream-state-metrics, but not others" > >>> and > >>>> then > >>>>>> I think it should be just at store-level. However, right now if we > >>>> want to > >>>>>> set it as store-level then it would be DEBUG and not exposed by > >>>> default. > >>>>>> > >>>>>> So it seems we have several options in addition to the proposed > >> one: > >>>>>> > >>>>>> a) we set it at store-level as INFO; but then one can argue why > >> this > >>> is > >>>>>> INFO while others (bytes-written, etc) are DEBUG. > >>>>>> b) we set it at store-level as DEBUG, believing that we do not > >>> usually > >>>> need > >>>>>> to turn it on. > >>>>>> c) maybe, we can set it at task-level (? I'm not so sure myself > >> about > >>>>>> this.. :P) as INFO. > >>>>>> > >>>>>> > >>>>>> Guozhang > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Mon, May 11, 2020 at 12:29 PM Sophie Blee-Goldman < > >>>> sop...@confluent.io> > >>>>>> wrote: > >>>>>> > >>>>>>> Hey Bruno, > >>>>>>> > >>>>>>> Thanks for the KIP! I have one high-level concern, which is that > >> we > >>>> should > >>>>>>> consider > >>>>>>> reporting these metrics on the per-store level rather than > >>>> instance-wide. I > >>>>>>> know I was > >>>>>>> the one who first proposed making it instance-wide, so bear with > >>> me: > >>>>>>> > >>>>>>> While I would still argue that the instance-wide memory usage is > >>>> probably > >>>>>>> the most *useful*, > >>>>>>> exposing them at the store-level does not prevent users from > >>>> monitoring the > >>>>>>> instance-wide > >>>>>>> memory. They should be able to roll up all the store-level > >> metrics > >>>> on an > >>>>>>> instance to > >>>>>>> compute the total off-heap memory. But rolling it up for the > >> users > >>>> does > >>>>>>> prevent them from > >>>>>>> using this to debug rare cases where one store may be using > >>>> significantly > >>>>>>> more memory than > >>>>>>> expected. > >>>>>>> > >>>>>>> It's also worth considering that some users may be using the > >>> bounded > >>>> memory > >>>>>>> config setter > >>>>>>> to put a cap on the off-heap memory of the entire process, in > >> which > >>>> case > >>>>>>> the memory usage > >>>>>>> metric for any one store should reflect the memory usage of the > >>>> entire > >>>>>>> instance. In that case > >>>>>>> any effort to roll up the memory usages ourselves would just be > >>>> wasted. > >>>>>>> > >>>>>>> Sorry for the reversal, but after a second thought I'm pretty > >>>> strongly in > >>>>>>> favor of reporting these > >>>>>>> at the store level. > >>>>>>> > >>>>>>> Best, > >>>>>>> Sophie > >>>>>>> > >>>>>>> On Wed, May 6, 2020 at 8:41 AM Bruno Cadonna <br...@confluent.io > >>> > >>>> wrote: > >>>>>>> > >>>>>>>> Hi all, > >>>>>>>> > >>>>>>>> I'd like to discuss KIP-607 that aims to add RocksDB memory > >> usage > >>>>>>>> metrics to Kafka Streams. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>> > >>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Record+the+Memory+Used+by+RocksDB+to+Kafka+Streams > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> Bruno > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> -- Guozhang > >>>>>> > >>>> > >>> > >> > >> > >> -- > >> -- Guozhang > >> > > > -- -- Guozhang