Yep, real users rarely went beyond page 25 so we fixed a limit for paging and for crawlers configured appropriately xml indexes in order to let them reach all the contents.
On Mon, Jun 28, 2021 at 11:07 AM Ashwin Ramesh <ash...@canva.com.invalid> wrote: > Currently we also have a 100M+ index across 8+ shards. We limit paging > after 100K results too. There is no reason why a real user wants to go from > page 1 to page 987. The only usecase is web crawlers. How does everyone > deail with this usecase? > > On Mon, Jun 28, 2021 at 3:42 PM Ere Maijala <ere.maij...@helsinki.fi> > wrote: > > > I believe the answer is, as usual, "it depends". For instance we used to > > have no issues with deep paging when we had a single-shard index. But as > > the index grew, we added more shards and ran into trouble when the start > > and rows params had no limits. The problem was that it took some time to > > identify what was causing the intermittent GC storms. Had there been a > > warning in the logs, we could have found it sooner. We currently limit > > to paging to (start+rows) 100000 with no issues (5 shards, ~85 million > > records). > > > > Some defence against deep paging in Solr itself would be of great > > benefit. Maybe it could be tied by default to the result set handling in > > a sharded index so that when a case is identified where memory usage is > > going to blow up, it would cause at least a warning, but I'd like to be > > able to set up a hard limit that could never be overridden [1]. There's > > another Lucene-based search product that has a an even stricter default > > limit of 10000. I think it's good idea as it forces a conscious decision > > if the limit is raised. > > > > --Ere > > > > [1] Even if the client has a limit, without a Solr-level limit there's > > the possibility of a bug, or a developer accidentally running a query > > that brings the system down. > > > > Rahul Goswami kirjoitti 27.6.2021 klo 6.08: > > > This begs a question...For anyone who has been burnt by the deep > > pagination > > > issue in the past, what is a reasonable value of "start" param beyond > > which > > > there is a noticeable performance degradation? > > > > > > Rahul > > > > > > On Fri, Jun 25, 2021 at 11:28 PM Walter Underwood < > wun...@wunderwood.org > > > > > > wrote: > > > > > >> Cursors require keeping session state outside of Solr. With a million > > >> queries per hour and the middle tier spread across lots of containers, > > that > > >> isn’t practical. Stateless searches are the default in Solr for a good > > >> reason. > > >> > > >> Using start and rows works great. The only issue is that Solr is > > >> defenseless against deep paging. > > >> > > >> wunder > > >> Walter Underwood > > >> wun...@wunderwood.org > > >> http://observer.wunderwood.org/ (my blog) > > >> > > >>> On Jun 25, 2021, at 8:09 PM, Dwane Hall <dwaneh...@hotmail.com> > wrote: > > >>> > > >>> Ok we lock down the rows and start params and then use cursors (which > > >> you don't want to use) for paging in increments of the page size. It > > works > > >> nicely for us but it sounds like it's not workable solution for you. > > >>> > > >>> Thanks, > > >>> > > >>> Dwane > > >>> From: Walter Underwood <wun...@wunderwood.org <mailto: > > >> wun...@wunderwood.org>> > > >>> Sent: Saturday, 26 June 2021 12:53 PM > > >>> To: users@solr.apache.org <mailto:users@solr.apache.org> < > > >> users@solr.apache.org <mailto:users@solr.apache.org>> > > >>> Subject: Re: Defense against deep paging? > > >>> > > >>> The start parameter needs to be read from the request. That is how > the > > >> client gets to the second page of results, by setting start=10 or > > start=20. > > >> The problem is when a bot sneaks through the checks and Solr gets > > >> start=3990000. A few of those will use all of heap and take down the > > server > > >> process. > > >>> > > >>> wunder > > >>> Walter Underwood > > >>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> > > >>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> > (my > > >> blog) > > >>> > > >>>> On Jun 25, 2021, at 6:40 PM, Dwane Hall <dwaneh...@hotmail.com > > >> <mailto:dwaneh...@hotmail.com>> wrote: > > >>>> > > >>>> Hey Walter, > > >>>> > > >>>> Can you set the value for start (0) and rows (your default sensible > > >> response row size) as an invariant in the request handler you're using > > so > > >> it can't be overridden from a client request? That's how I've defended > > >> against it from Solr's perspective in the past. This can be hard coded > > in > > >> your request handler in the XML of your solr-config or using the > > parameters > > >> API. I've found it simple but effective approach and there's an > example > > >> here from the docs ( > > >> > > > https://solr.apache.org/guide/8_8/requesthandlers-and-searchcomponents-in-solrconfig.html#request-handlers > > >> < > > >> > > > https://solr.apache.org/guide/8_8/requesthandlers-and-searchcomponents-in-solrconfig.html#request-handlers > > >>> ). > > >>>> > > >>>> Thanks, > > >>>> > > >>>> Dwane > > >>>> From: Walter Underwood <wun...@wunderwood.org <mailto: > > >> wun...@wunderwood.org>> > > >>>> Sent: Saturday, 26 June 2021 6:39 AM > > >>>> To: users@solr.apache.org <mailto:users@solr.apache.org> < > > >> users@solr.apache.org <mailto:users@solr.apache.org>> > > >>>> Subject: Re: Defense against deep paging? > > >>>> > > >>>> Thanks, that is exactly the info I wanted! I’ve commented there, > even > > >> though it is closed as Won’t Do. > > >>>> > > >>>> wunder > > >>>> Walter Underwood > > >>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> > > >>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> < > > >> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>> > (my > > >> blog) > > >>>> > > >>>>> On Jun 25, 2021, at 12:46 PM, Mike Drob <md...@mdrob.com <mailto: > > >> md...@mdrob.com>> wrote: > > >>>>> > > >>>>> This was discussed somewhat in > > >>>>> https://issues.apache.org/jira/browse/SOLR-15252 < > > >> https://issues.apache.org/jira/browse/SOLR-15252>< > > >> https://issues.apache.org/jira/browse/SOLR-15252 < > > >> https://issues.apache.org/jira/browse/SOLR-15252>> with no > > >>>>> implementation provided. > > >>>>> > > >>>>> On Fri, Jun 25, 2021 at 11:52 AM Walter Underwood < > > >> wun...@wunderwood.org <mailto:wun...@wunderwood.org>> wrote: > > >>>>>> > > >>>>>> I already said that we have a limit in the client code. I’m asking > > >> about a limit in Solr. > > >>>>>> > > >>>>>> wunder > > >>>>>> Walter Underwood > > >>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> > > >>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> > < > > >> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>> > (my > > >> blog) > > >>>>>> > > >>>>>>> On Jun 25, 2021, at 11:50 AM, Håvard Wahl Kongsgård < > > >> haavard.kongsga...@gmail.com <mailto:haavard.kongsga...@gmail.com>> > > wrote: > > >>>>>>> > > >>>>>>> Just create a proxy client between the user and solr. Set if page > > >>> = 500 …. > > >>>>>>> else > > >>>>>>> > > >>>>>>> Simple stuff > > >>>>>>> > > >>>>>>> fre. 25. jun. 2021 kl. 19:20 skrev Walter Underwood < > > >> wun...@wunderwood.org <mailto:wun...@wunderwood.org>>: > > >>>>>>> > > >>>>>>>> Has anyone implemented protection against deep paging inside > > >> Solr? I’m > > >>>>>>>> thinking about something like a max_rows parameter, where if > > >> start+rows was > > >>>>>>>> greater than that, it would limit the max result to that number. > > >> Or maybe > > >>>>>>>> just return a 400, that would be OK too. > > >>>>>>>> > > >>>>>>>> I’ve had three or four outages caused by deep paging over the > > >> past dozen > > >>>>>>>> years with Solr. We implement a limit in the client code, then > > >> someone > > >>>>>>>> forgets to add it to the redesigned client code. A limit in the > > >> request > > >>>>>>>> handler would be so much easier. > > >>>>>>>> > > >>>>>>>> And yes, I know about cursor marks. We don’t want to enable deep > > >> paging, > > >>>>>>>> we want to stop it. > > >>>>>>>> > > >>>>>>>> wunder > > >>>>>>>> Walter Underwood > > >>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> > > >>>>>>>> http://observer.wunderwood.org/ < > http://observer.wunderwood.org/> > > >> <http://observer.wunderwood.org/ <http://observer.wunderwood.org/>> > > (my > > >> blog) > > >>>>>>>> > > >>>>>>>> -- > > >>>>>>> Håvard Wahl Kongsgård > > >>>>>>> Data Scientist > > >>>>>> > > >>>> > > >> > > >> > > > > > > > -- > > Ere Maijala > > Kansalliskirjasto / The National Library of Finland > > > > -- > ** > ** <https://www.canva.com/>Empowering the world to design > Share accurate > information on COVID-19 and spread messages of support to your community. > Here are some resources > < > https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates> > > that can help. > <https://twitter.com/canva> <https://facebook.com/canva> > <https://au.linkedin.com/company/canva> <https://twitter.com/canva> > <https://facebook.com/canva> <https://au.linkedin.com/company/canva> > <https://instagram.com/canva> > > > > > > > > > > > -- Vincenzo D'Amore