Re: [DISCUSS] PrometheusMetricsServlet performance improvement

Michael Marshall Thu, 24 Feb 2022 21:27:13 -0800

> Old codes, I think at that time, prometheus is not popular yet

I think there is likely more explanation here, since we could have
switched any time in the past few years when prometheus was already
popular.


Before we add caching to our metrics generation, I think we should
consider migrating to the prometheus client. I can't tell from the
prometheus client documentation whether the client has this caching
feature. If it does, then that is an easy win. If it does, I wonder if
that implies that premetheus endpoints are not meant to be queried too
frequently.

> In cloud, maybe cloud services providers monitors the cluster, and users also 
> monitors it.

Are you able to provide more detail about which cloud service
providers? Is this just a prometheus server scraping metrics?
Regarding users, I would recommend they view prometheus metrics via
prometheus/grafana precisely because it will decrease load on the broker.
I don't mean to be too pedantic, but this whole feature relies on the
premise that brokers are handling frequent calls to the /metrics
endpoint, so I would like to understand the motivation.

Thanks,
Michael



On Thu, Feb 24, 2022 at 10:48 PM Jiuming Tao
<jm...@streamnative.io.invalid> wrote:
>
> > I have a historical question. Why do we write and maintain our own
> > code to generate the metrics response instead of using the prometheus
> > client library?
>
> Old codes, I think at that time, prometheus is not popular yet
>
>
> >> I have learned that the /metrics endpoint will be requested by more than
> >> one metrics collect system.
> >
> > In practice, when does this happen?
> In cloud, maybe cloud services providers monitors the cluster, and users also 
> monitors it.
>
> >> PrometheusMetricsGenerator#generate will be invoked once in a period(such
> >> as 1 minute), the result will be cached and returned for every metrics
> >> collect request in the period directly.
> >
> > Since there are tradeoffs to the cache duration, we should make the
> > period configurable.
>
> Yes, of course
>
> > 2022年2月25日 下午12:41，Michael Marshall <mmarsh...@apache.org> 写道：
> >
> > I have a historical question. Why do we write and maintain our own
> > code to generate the metrics response instead of using the prometheus
> > client library?
> >
> >> I have learned that the /metrics endpoint will be requested by more than
> >> one metrics collect system.
> >
> > In practice, when does this happen?
> >
> >> PrometheusMetricsGenerator#generate will be invoked once in a period(such
> >> as 1 minute), the result will be cached and returned for every metrics
> >> collect request in the period directly.
> >
> > Since there are tradeoffs to the cache duration, we should make the
> > period configurable.
> >
> > Thanks,
> > Michael
> >
> > On Wed, Feb 23, 2022 at 11:06 AM Jiuming Tao
> > <jm...@streamnative.io.invalid> wrote:
> >>
> >> Hi all,
> >>>
> >>> 2. When there are hundreds MB metrics data collected, it causes high heap 
> >>> memory usage, high CPU usage and GC pressure. In the 
> >>> `PrometheusMetricsGenerator#generate` method, it uses 
> >>> `ByteBufAllocator.DEFAULT.heapBuffer()` to allocate memory for writing 
> >>> metrics data. The default size of `ByteBufAllocator.DEFAULT.heapBuffer()` 
> >>> is 256 bytes, when the buffer resizes, the new buffer capacity is 512 
> >>> bytes(power of 2) and with `mem_copy` operation.
> >>> If I want to write 100 MB data to the buffer, the current buffer size is 
> >>> 128 MB, and the total memory usage is close to 256 MB (256bytes + 512 
> >>> bytes + 1k + .... + 64MB + 128MB). When the buffer size is greater than 
> >>> netty buffer chunkSize(16 MB), it will be allocated as 
> >>> UnpooledHeapByteBuf in the heap. After writing metrics data into the 
> >>> buffer, return it to the client by jetty, jetty will copy it into jetty's 
> >>> buffer with memory allocation in the heap, again!
> >>> In this condition, for the purpose of saving memory, avoid high CPU 
> >>> usage(too much memory allocations and `mem_copy` operations) and reducing 
> >>> GC pressure, I want to change `ByteBufAllocator.DEFAULT.heapBuffer()` to 
> >>> `ByteBufAllocator.DEFAULT.compositeDirectBuffer()`, it wouldn't cause 
> >>> `mem_copy` operations and huge memory allocations(CompositeDirectByteBuf 
> >>> is a bit slowly in read/write, but it's worth). After writing data, I 
> >>> will call the `HttpOutput#write(ByteBuffer)` method and write it to the 
> >>> client, the method won't cause `mem_copy` (I have to wrap ByteBuf to 
> >>> ByteBuffer, if ByteBuf wrapped, there will be zero-copy).
> >>
> >> The jdk in my local is jdk15, I just noticed that in jdk8, ByteBuffer 
> >> cannot be extended and implemented. So, if allowed, I will write metrics 
> >> data to temp files and send it to client by jetty’s send_file. It will be 
> >> turned out a better performance than `CompositeByteBuf`, and takes lower 
> >> CPU usage due to I/O blocking.(The /metrics endpoint will be a bit slowly, 
> >> I believe it’s worth).
> >> If not allowed, it’s no matter and it also has a better performance than 
> >> `ByteBufAllocator.DEFAULT.heapBuffer()`(see the first image in original 
> >> mail).
> >>
> >> Thanks,
> >> Tao Jiuming
>

Re: [DISCUSS] PrometheusMetricsServlet performance improvement

Reply via email to