> I do not have in mind a scenario where it could be useful to specify a LIMIT > in bytes. The LIMIT clause is usually used when you know how many rows you > wish to display or use. Unless somebody has a useful scenario in mind I do > not think that there is a need for that feature. If you have rows that vary significantly in their size, your latencies could end up being pretty unpredictable using a LIMIT BY <row_count>. Being able to specify a limit by bytes at the driver / API level would allow app devs to get more deterministic results out of their interaction w/the DB if they're looking to respond back to a client within a certain time frame and / or determine next steps in the app (continue paging, stop, etc) based on how long it took to get results back.
I'm seeing similar tradeoffs working on gracefully paging over tombstones; there's a strong desire to be able to have more confidence in the statement "If I ask the server for a page of data, I'll very likely get it back within time X". There's an argument that it's a data modeling problem and apps should model differently to have more consistent row sizes and/or tombstone counts; I'm sympathetic to that but the more we can loosen those constraints on users the better their experience in my opinion. On Mon, Jun 12, 2023, at 5:39 AM, Jacek Lewandowski wrote: > Yes, LIMIT BY <bytes> provided by the user in CQL does not make much sense to > me either > > > pon., 12 cze 2023 o 11:20 Benedict <bened...@apache.org> napisał(a): >> >> I agree that this is more suitable as a paging option, and not as a CQL >> LIMIT option. >> >> If it were to be a CQL LIMIT option though, then it should be accurate >> regarding result set IMO; there shouldn’t be any further results that could >> have been returned within the LIMIT. >> >> >>> On 12 Jun 2023, at 10:16, Benjamin Lerer <ble...@apache.org> wrote: >>> >>> Thanks Jacek for raising that discussion. >>> >>> I do not have in mind a scenario where it could be useful to specify a >>> LIMIT in bytes. The LIMIT clause is usually used when you know how many >>> rows you wish to display or use. Unless somebody has a useful scenario in >>> mind I do not think that there is a need for that feature. >>> >>> Paging in bytes makes sense to me as the paging mechanism is transparent >>> for the user in most drivers. It is simply a way to optimize your memory >>> usage from end to end. >>> >>> I do not like the approach of using both of them simultaneously because if >>> you request a page with a certain amount of rows and do not get it then is >>> is really confusing and can be a problem for some usecases. We have users >>> keeping their session open and the page information to display page of data. >>> >>> Le lun. 12 juin 2023 à 09:08, Jacek Lewandowski >>> <lewandowski.ja...@gmail.com> a écrit : >>>> Hi, >>>> >>>> I was working on limiting query results by their size expressed in bytes, >>>> and some questions arose that I'd like to bring to the mailing list. >>>> >>>> The semantics of queries (without aggregation) - data limits are applied >>>> on the raw data returned from replicas - while it works fine for the row >>>> number limits as the number of rows is not likely to change after >>>> post-processing, it is not that accurate for size based limits as the cell >>>> sizes may be different after post-processing (for example due to applying >>>> some transformation function, projection, or whatever). >>>> >>>> We can truncate the results after post-processing to stay within the >>>> user-provided limit in bytes, but if the result is smaller than the limit >>>> - we will not fetch more. In that case, the meaning of "limit" being an >>>> actual limit is valid though it would be misleading for the page size >>>> because we will not fetch the maximum amount of data that does not exceed >>>> the page size. >>>> >>>> Such a problem is much more visible for "group by" queries with >>>> aggregation. The paging and limiting mechanism is applied to the rows >>>> rather than groups, as it has no information about how much memory a >>>> single group uses. For now, I've approximated a group size as the size of >>>> the largest participating row. >>>> >>>> The problem concerns the allowed interpretation of the size limit >>>> expressed in bytes. Whether we want to use this mechanism to let the users >>>> precisely control the size of the resultset, or we instead want to use >>>> this mechanism to limit the amount of memory used internally for the data >>>> and prevent problems (assuming restricting size and rows number can be >>>> used simultaneously in a way that we stop when we reach any of the >>>> specified limits). >>>> >>>> https://issues.apache.org/jira/browse/CASSANDRA-11745 >>>> >>>> thanks, >>>> - - -- --- ----- -------- ------------- >>>> Jacek Lewandowski