Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

Josh McKenzie Mon, 12 Jun 2023 07:24:22 -0700

> I do not have in mind a scenario where it could be useful to specify a LIMIT 
> in bytes. The LIMIT clause is usually used when you know how many rows you 
> wish to display or use. Unless somebody has a useful scenario in mind I do 
> not think that there is a need for that feature.
If you have rows that vary significantly in their size, your latencies could 
end up being pretty unpredictable using a LIMIT BY <row_count>. Being able to 
specify a limit by bytes at the driver / API level would allow app devs to get 
more deterministic results out of their interaction w/the DB if they're looking 
to respond back to a client within a certain time frame and / or determine next 
steps in the app (continue paging, stop, etc) based on how long it took to get 
results back.


I'm seeing similar tradeoffs working on gracefully paging over tombstones; 
there's a strong desire to be able to have more confidence in the statement "If 
I ask the server for a page of data, I'll very likely get it back within time 
X".

There's an argument that it's a data modeling problem and apps should model 
differently to have more consistent row sizes and/or tombstone counts; I'm 
sympathetic to that but the more we can loosen those constraints on users the 
better their experience in my opinion.

On Mon, Jun 12, 2023, at 5:39 AM, Jacek Lewandowski wrote:
> Yes, LIMIT BY <bytes> provided by the user in CQL does not make much sense to 
> me either
> 
> 
> pon., 12 cze 2023 o 11:20 Benedict <bened...@apache.org> napisał(a):
>> 
>> I agree that this is more suitable as a paging option, and not as a CQL 
>> LIMIT option. 
>> 
>> If it were to be a CQL LIMIT option though, then it should be accurate 
>> regarding result set IMO; there shouldn’t be any further results that could 
>> have been returned within the LIMIT.
>> 
>> 
>>> On 12 Jun 2023, at 10:16, Benjamin Lerer <ble...@apache.org> wrote:
>>> 
>>> Thanks Jacek for raising that discussion.
>>> 
>>> I do not have in mind a scenario where it could be useful to specify a 
>>> LIMIT in bytes. The LIMIT clause is usually used when you know how many 
>>> rows you wish to display or use. Unless somebody has a useful scenario in 
>>> mind I do not think that there is a need for that feature.
>>> 
>>> Paging in bytes makes sense to me as the paging mechanism is transparent 
>>> for the user in most drivers. It is simply a way to optimize your memory 
>>> usage from end to end.
>>> 
>>> I do not like the approach of using both of them simultaneously because if 
>>> you request a page with a certain amount of rows and do not get it then is 
>>> is really confusing and can be a problem for some usecases. We have users 
>>> keeping their session open and the page information to display page of data.
>>> 
>>> Le lun. 12 juin 2023 à 09:08, Jacek Lewandowski 
>>> <lewandowski.ja...@gmail.com> a écrit :
>>>> Hi,
>>>> 
>>>> I was working on limiting query results by their size expressed in bytes, 
>>>> and some questions arose that I'd like to bring to the mailing list.
>>>> 
>>>> The semantics of queries (without aggregation) - data limits are applied 
>>>> on the raw data returned from replicas - while it works fine for the row 
>>>> number limits as the number of rows is not likely to change after 
>>>> post-processing, it is not that accurate for size based limits as the cell 
>>>> sizes may be different after post-processing (for example due to applying 
>>>> some transformation function, projection, or whatever). 
>>>> 
>>>> We can truncate the results after post-processing to stay within the 
>>>> user-provided limit in bytes, but if the result is smaller than the limit 
>>>> - we will not fetch more. In that case, the meaning of "limit" being an 
>>>> actual limit is valid though it would be misleading for the page size 
>>>> because we will not fetch the maximum amount of data that does not exceed 
>>>> the page size.
>>>> 
>>>> Such a problem is much more visible for "group by" queries with 
>>>> aggregation. The paging and limiting mechanism is applied to the rows 
>>>> rather than groups, as it has no information about how much memory a 
>>>> single group uses. For now, I've approximated a group size as the size of 
>>>> the largest participating row. 
>>>> 
>>>> The problem concerns the allowed interpretation of the size limit 
>>>> expressed in bytes. Whether we want to use this mechanism to let the users 
>>>> precisely control the size of the resultset, or we instead want to use 
>>>> this mechanism to limit the amount of memory used internally for the data 
>>>> and prevent problems (assuming restricting size and rows number can be 
>>>> used simultaneously in a way that we stop when we reach any of the 
>>>> specified limits).
>>>> 
>>>> https://issues.apache.org/jira/browse/CASSANDRA-11745
>>>> 
>>>> thanks,
>>>> - - -- --- ----- -------- -------------
>>>> Jacek Lewandowski

Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

Reply via email to