Re: [DISCUSS] KIP-1062: Introduce Pagination for some requests used by Admin API

Omnia Ibrahim Thu, 01 Aug 2024 06:54:50 -0700

Great, I’ll ask again for a vote then on this KIP as it seems like all the 
concerns have been addressed.



> On 30 Jul 2024, at 16:00, David Arthur <mum...@gmail.com> wrote:
> 
> Omnia, thanks for the updates!
> 
>> Am happy to add section for throttling in this KIP if it is high concern
> or open a followup KIP for this once we already have the pagination in
> place. Which one do you suggest?
> 
> I'm okay leaving throttling for a future KIP. It might be useful to see the
> feature in action for a while before deciding if its necessary or the best
> way to approach it.
> 
> On Mon, Jul 22, 2024 at 9:23 AM Omnia Ibrahim <o.g.h.ibra...@gmail.com>
> wrote:
> 
>> 
>> Hi David, thanks for the feedback and sorry for taking long to respond as
>> I was off for a week.
>>> DA1: In "Public Interfaces" you say "max.request.pagination.size.limit"
>>> controls the max items to return by default. It's not clear to me if this
>>> is just a default, or if it is a hard limit. In KIP-966, this config
>> serves
>>> as a hard limit to prevent misconfigured or malicious clients from
>>> requesting too many resources. Can you clarify this bit?
>> 
>> `max.request.partition.size.limit` will be used in same way as KIP-966 I
>> just meant `max.request.partition.size.limit` will equal
>> `max.request.pagination.size.limit` by default unless it is specified
>> otherwise. I clarified this in the KIP now
>> 
>>> DA2: Is "ItemsLeftToFetch" accounting for authorization? If not, it could
>>> be considered a minor info leak.
>> 
>> This is a good point. Any of the requests still will count to what ACLs
>> and resources the authorised user is used by the client, the pagination
>> will not effect this.
>> In cases where the client is using user with wild ACLs I am assuming this
>> is okay and they have the right to see this info.
>> However am rethinking this now as it might not be that useful and we can
>> just relay on if the there is a next cursor or not to simplify the approach
>> similar to KIP-966. I have updated the KIP to reflect this.
>> 
>>> DA3: By splitting up results into pages, we are introducing the
>> possibility
>>> of inconsistency into these RPCs. For example, today MetadataRequest
>>> returns results from the same MetadataImage, so the response is
>> consistent.
>>> With the paging approach, it is possible (likely even) that different
>>> requests will be served from different MetadataImage-s, leading to
>>> inconsistencies. This can be even worse if paged requests go to different
>>> brokers that may be lagging behind in metadata propagation. BTW this
>> issue
>>> exists for KIP-966 as well. We don't necessarily need to solve this right
>>> away, but I think it's worth mentioning in the KIP.
>> 
>> I added a limitation section to the KIP to mention this. I also mentioned
>> it in the top section of public interfaces.
>> 
>>> DA4: Have we considered some generic throttling for paged requests? I
>>> expect it might be an issue if clients want to get everything and just
>> page
>>> through all of the results as quickly as possible.
>> I didn’t consider throttling for pagination requests as
>> Right now the only throttling AdminClient knows is throttling
>> TopicCreate/Delete which is different than pagination and might need it is
>> own conversation and KIP.
>> For example in the case of throttling and retries > timeouts, should
>> consider send back what we fetched so far and allow the operator to set the
>> cursor next time. If this is the case then we need to include cursor to all
>> the Option classes to these requests. Also Admin API for
>> DescribeTopicPartitionRequest in KIP-966 don’t provide Cursor as part of
>> DescribeTopicsOptions.
>> Also extending `controllerMutation` or should we separate the paging
>> throttling to its own quota
>> The only requests I think might actively scraped are `OffsetFetchRequest`,
>> `ListGroupsRequest`, `DescribeGroupsRequest` and
>> `ConsumerGroupDescribeRequest` to actively provide lag metrics/dashboards
>> to consumers. So there might be too many pages.
>> The rest of the requests mostly used during maintenance of the cluster or
>> incidents (specially the producer/txn requests) and operator of the cluster
>> need them to take a decision. The pagination just provides them with a way
>> to escape the timeout problem with large clusters. So am not sure adding
>> throttling during such time would be wise.
>> Am happy to add section for throttling in this KIP if it is high concern
>> or open a followup KIP for this once we already have the pagination in
>> place. Which one do you suggest?
>> 
>> Thanks
>> Omnia
>> 
>>> On 12 Jul 2024, at 14:56, David Arthur <mum...@gmail.com> wrote:
>>> 
>>> Hey Omnia, thanks for the KIP! I think this will be a really nice
>>> improvement for operators.
>>> 
>>> DA1: In "Public Interfaces" you say "max.request.pagination.size.limit"
>>> controls the max items to return by default. It's not clear to me if this
>>> is just a default, or if it is a hard limit. In KIP-966, this config
>> serves
>>> as a hard limit to prevent misconfigured or malicious clients from
>>> requesting too many resources. Can you clarify this bit?
>>> 
>>> DA2: Is "ItemsLeftToFetch" accounting for authorization? If not, it could
>>> be considered a minor info leak.
>>> 
>>> DA3: By splitting up results into pages, we are introducing the
>> possibility
>>> of inconsistency into these RPCs. For example, today MetadataRequest
>>> returns results from the same MetadataImage, so the response is
>> consistent.
>>> With the paging approach, it is possible (likely even) that different
>>> requests will be served from different MetadataImage-s, leading to
>>> inconsistencies. This can be even worse if paged requests go to different
>>> brokers that may be lagging behind in metadata propagation. BTW this
>> issue
>>> exists for KIP-966 as well. We don't necessarily need to solve this right
>>> away, but I think it's worth mentioning in the KIP.
>>> 
>>> DA4: Have we considered some generic throttling for paged requests? I
>>> expect it might be an issue if clients want to get everything and just
>> page
>>> through all of the results as quickly as possible.
>>> 
>>> Cheers,
>>> David
>>> 
>>> 
>>> On Tue, Jul 9, 2024 at 8:20 AM Omnia Ibrahim <o.g.h.ibra...@gmail.com>
>>> wrote:
>>> 
>>>> Hey,
>>>> I will open a voting thread by end of this week if there is no more
>>>> feedbacks.
>>>> 
>>>> Thanks
>>>> Omnia
>>>> 
>>>>> On 2 Jul 2024, at 17:25, Omnia Ibrahim <o.g.h.ibra...@gmail.com>
>> wrote:
>>>>> 
>>>>> Hi David, Thanks for having a look into this!
>>>>> 
>>>>>> DJ1: Why do we want to migrate only a subset of the APIs vs migrating
>>>> all
>>>>>> of them? For instance, there are DescribeGroups,
>> ConsumerGroupDescribe,
>>>>>> etc. Do we have reasons not to migrate them too? I think that it would
>>>> be
>>>>>> great to have a KIP that establishes the pattern for all the admin
>> APIs.
>>>>> 
>>>>> No particular reason I only was trying to focus on the requests that
>>>> cause the most pain ( as far as I know with Kafka). I don’t mind expand
>> it
>>>> to other requests in AdminClient as well.
>>>>> The other thing I was wondering about if
>>>> max.request.partition.size.limit still make sense or should it be
>> renamed
>>>> (or have another one called max.request.pagingation.size.limit) if we
>> going
>>>> to expand to cover all Admin requests.  I am leaning toward
>>>> max.request.pagingation.size.limit and max.request.partition.size.limit
>> can
>>>> be set for now to max.request.partition.size.limit by default.
>>>>> 
>>>>> I have updated the KIP now to cover more requests please have a second
>>>> look and let me know if I missed any?
>>>>> 
>>>>>> DJ2: I am not a fan of all the new parameters passed to the tools
>>>>>> (e.g. --partition-size-limit-per-response). I wonder if we could
>> rather
>>>>>> have a config in the admin client to set the default page size for all
>>>>>> requests.
>>>>> 
>>>>> If it going to be pattern to all requests I agree it should be an admin
>>>> config then.
>>>>> Updated the KIP with this now.
>>>>> 
>>>>>> DJ3: I assume that the Admin client will transparently handle the
>>>> change.
>>>>>> It would be great to call it out in the compatibility section.
>>>>> 
>>>>> Updated, Old clients will still have the old behaviour of no
>> pagingation
>>>> while all new versions will support this.
>>>>> 
>>>>> Thanks
>>>>> Omnia
>>>>> 
>>>>>> On 2 Jul 2024, at 10:43, David Jacot <dja...@confluent.io.INVALID>
>>>> wrote:
>>>>>> 
>>>>>> Hi Omnia,
>>>>>> 
>>>>>> Thanks for the KIP. I agree that we should migrate admin APIs to the
>> new
>>>>>> pattern.
>>>>>> 
>>>>>> DJ1: Why do we want to migrate only a subset of the APIs vs migrating
>>>> all
>>>>>> of them? For instance, there are DescribeGroups,
>> ConsumerGroupDescribe,
>>>>>> etc. Do we have reasons not to migrate them too? I think that it would
>>>> be
>>>>>> great to have a KIP that establishes the pattern for all the admin
>> APIs.
>>>>>> 
>>>>>> DJ2: I am not a fan of all the new parameters passed to the tools
>>>>>> (e.g. --partition-size-limit-per-response). I wonder if we could
>> rather
>>>>>> have a config in the admin client to set the default page size for all
>>>>>> requests.
>>>>>> 
>>>>>> DJ3: I assume that the Admin client will transparently handle the
>>>> change.
>>>>>> It would be great to call it out in the compatibility section.
>>>>>> 
>>>>>> Thanks,
>>>>>> David
>>>>>> 
>>>>>> On Tue, Jul 2, 2024 at 11:17 AM Andrew Schofield <
>>>> andrew_schofi...@live.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> Thanks for the response. Makes sense to me. Just one additional
>>>> comment:
>>>>>>> 
>>>>>>> AS5: The cursor for ListGroupsResponse is called
>> `TransactionalCursor`
>>>>>>> which
>>>>>>> seems like a copy-paste mistake.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>> 
>>>>>>>> On 30 Jun 2024, at 22:28, Omnia Ibrahim <o.g.h.ibra...@gmail.com>
>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi Andrew thanks for having a look into the KIP
>>>>>>>> 
>>>>>>>>> AS1: Besides topics, the most numerous resources in Kafka clusters
>> in
>>>>>>> my experience
>>>>>>>>> are consumer groups. Would it be possible to extend the KIP to
>> cover
>>>>>>> ListGroups while
>>>>>>>>> you’re in here? I’ve heard of clusters with truly vast numbers of
>>>>>>> groups. This is also
>>>>>>>>> potentially a sign of a misbehaving or poorly written clients.
>>>> Getting
>>>>>>> a page of groups
>>>>>>>>> with a massive ItemsLeftToFetch would be nice.
>>>>>>>> Yes, I also had few experiences with large cluster where to list
>>>>>>> consumer groups can take up to 5min. I update the KIP to include this
>>>> as
>>>>>>> well.
>>>>>>>> 
>>>>>>>>> AS2: A tiny nit: The versions for the added fields are incorrect in
>>>>>>> some cases.
>>>>>>>> I believe I fixed all of them now
>>>>>>>> 
>>>>>>>>> AS3: I don’t quite understand the cursor for
>>>>>>> OffsetFetchRequest/Response.
>>>>>>>>> It looks like the cursor is (topic, partition), but not group ID.
>>>> Does
>>>>>>> the cursor
>>>>>>>>> apply to all groups in the request, or is group ID missing?
>>>>>>>> 
>>>>>>>> I was thinking that the last one in the response will be the one
>> that
>>>>>>> has the cursor while the rest will have null. But if we are moving
>>>>>>> NextCursour to the top level of the response then the cursor will
>> need
>>>>>>> groupID.
>>>>>>>>> AS4: For the remaining request/response pairs, the cursor makes
>> sense
>>>>>>> to me,
>>>>>>>>> but I do wonder whether `NextCursor` should be at the top level of
>>>> the
>>>>>>> responses
>>>>>>>>> instead, like DescribeTopicPartitionsResponse.
>>>>>>>> 
>>>>>>>> Updates the KIP to reflect this now.
>>>>>>>> 
>>>>>>>> Let me know if you have any more feedback on this.
>>>>>>>> 
>>>>>>>> Best
>>>>>>>> Omnia
>>>>>>>> 
>>>>>>>>> On 27 Jun 2024, at 17:53, Andrew Schofield <
>>>> andrew_schofi...@live.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Omnia,
>>>>>>>>> Thanks for the KIP. This is a really nice improvement for
>>>> administering
>>>>>>> large clusters.
>>>>>>>>> 
>>>>>>>>> AS1: Besides topics, the most numerous resources in Kafka clusters
>> in
>>>>>>> my experience
>>>>>>>>> are consumer groups. Would it be possible to extend the KIP to
>> cover
>>>>>>> ListGroups while
>>>>>>>>> you’re in here? I’ve heard of clusters with truly vast numbers of
>>>>>>> groups. This is also
>>>>>>>>> potentially a sign of a misbehaving or poorly written clients.
>>>> Getting
>>>>>>> a page of groups
>>>>>>>>> with a massive ItemsLeftToFetch would be nice.
>>>>>>>>> 
>>>>>>>>> AS2: A tiny nit: The versions for the added fields are incorrect in
>>>>>>> some cases.
>>>>>>>>> 
>>>>>>>>> AS3: I don’t quite understand the cursor for
>>>>>>> OffsetFetchRequest/Response.
>>>>>>>>> It looks like the cursor is (topic, partition), but not group ID.
>>>> Does
>>>>>>> the cursor
>>>>>>>>> apply to all groups in the request, or is group ID missing?
>>>>>>>>> 
>>>>>>>>> AS4: For the remaining request/response pairs, the cursor makes
>> sense
>>>>>>> to me,
>>>>>>>>> but I do wonder whether `NextCursor` should be at the top level of
>>>> the
>>>>>>> responses
>>>>>>>>> instead, like DescribeTopicPartitionsResponse.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Andrew
>>>>>>>>> 
>>>>>>>>>> On 27 Jun 2024, at 14:05, Omnia Ibrahim <o.g.h.ibra...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi everyone, I would like to start a discussion thread for
>> KIP-1062
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1062%3A+Introduce+Pagination+for+some+requests+used+by+Admin+API
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Omnia
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> David Arthur
>> 
>> 
> 
> -- 
> David Arthur

Re: [DISCUSS] KIP-1062: Introduce Pagination for some requests used by Admin API

Reply via email to