Re: Defense against deep paging?

Ashwin Ramesh Mon, 28 Jun 2021 02:07:09 -0700

Currently we also have a 100M+ index across 8+ shards. We limit paging
after 100K results too. There is no reason why a real user wants to go from
page 1 to page 987. The only usecase is web crawlers. How does everyone
deail with this usecase?


On Mon, Jun 28, 2021 at 3:42 PM Ere Maijala <ere.maij...@helsinki.fi> wrote:

> I believe the answer is, as usual, "it depends". For instance we used to
> have no issues with deep paging when we had a single-shard index. But as
> the index grew, we added more shards and ran into trouble when the start
> and rows params had no limits. The problem was that it took some time to
> identify what was causing the intermittent GC storms. Had there been a
> warning in the logs, we could have found it sooner. We currently limit
> to paging to (start+rows) 100000 with no issues (5 shards, ~85 million
> records).
>
> Some defence against deep paging in Solr itself would be of great
> benefit. Maybe it could be tied by default to the result set handling in
> a sharded index so that when a case is identified where memory usage is
> going to blow up, it would cause at least a warning, but I'd like to be
> able to set up a hard limit that could never be overridden [1]. There's
> another Lucene-based search product that has a an even stricter default
> limit of 10000. I think it's good idea as it forces a conscious decision
> if the limit is raised.
>
> --Ere
>
> [1] Even if the client has a limit, without a Solr-level limit there's
> the possibility of a bug, or a developer accidentally running a query
> that brings the system down.
>
> Rahul Goswami kirjoitti 27.6.2021 klo 6.08:
> > This begs a question...For anyone who has been burnt by the deep
> pagination
> > issue in the past, what is a reasonable value of "start" param beyond
> which
> > there is a noticeable performance degradation?
> >
> > Rahul
> >
> > On Fri, Jun 25, 2021 at 11:28 PM Walter Underwood <wun...@wunderwood.org
> >
> > wrote:
> >
> >> Cursors require keeping session state outside of Solr. With a million
> >> queries per hour and the middle tier spread across lots of containers,
> that
> >> isn’t practical. Stateless searches are the default in Solr for a good
> >> reason.
> >>
> >> Using start and rows works great. The only issue is that Solr is
> >> defenseless against deep paging.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Jun 25, 2021, at 8:09 PM, Dwane Hall <dwaneh...@hotmail.com> wrote:
> >>>
> >>> Ok we lock down the rows and start params and then use cursors (which
> >> you don't want to use) for paging in increments of the page size.  It
> works
> >> nicely for us but it sounds like it's not workable solution for you.
> >>>
> >>> Thanks,
> >>>
> >>> Dwane
> >>> From: Walter Underwood <wun...@wunderwood.org <mailto:
> >> wun...@wunderwood.org>>
> >>> Sent: Saturday, 26 June 2021 12:53 PM
> >>> To: users@solr.apache.org <mailto:users@solr.apache.org> <
> >> users@solr.apache.org <mailto:users@solr.apache.org>>
> >>> Subject: Re: Defense against deep paging?
> >>>
> >>> The start parameter needs to be read from the request. That is how the
> >> client gets to the second page of results, by setting start=10 or
> start=20.
> >> The problem is when a bot sneaks through the checks and Solr gets
> >> start=3990000. A few of those will use all of heap and take down the
> server
> >> process.
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> >>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
> >> blog)
> >>>
> >>>> On Jun 25, 2021, at 6:40 PM, Dwane Hall <dwaneh...@hotmail.com
> >> <mailto:dwaneh...@hotmail.com>> wrote:
> >>>>
> >>>> Hey Walter,
> >>>>
> >>>> Can you set the value for start (0) and rows (your default sensible
> >> response row size) as an invariant in the request handler you're using
> so
> >> it can't be overridden from a client request? That's how I've defended
> >> against it from Solr's perspective in the past. This can be hard coded
> in
> >> your request handler in the XML of your solr-config or using the
> parameters
> >> API. I've found it simple but effective approach and there's an example
> >> here from the docs (
> >>
> https://solr.apache.org/guide/8_8/requesthandlers-and-searchcomponents-in-solrconfig.html#request-handlers
> >> <
> >>
> https://solr.apache.org/guide/8_8/requesthandlers-and-searchcomponents-in-solrconfig.html#request-handlers
> >>> ).
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Dwane
> >>>> From: Walter Underwood <wun...@wunderwood.org <mailto:
> >> wun...@wunderwood.org>>
> >>>> Sent: Saturday, 26 June 2021 6:39 AM
> >>>> To: users@solr.apache.org <mailto:users@solr.apache.org> <
> >> users@solr.apache.org <mailto:users@solr.apache.org>>
> >>>> Subject: Re: Defense against deep paging?
> >>>>
> >>>> Thanks, that is exactly the info I wanted! I’ve commented there, even
> >> though it is closed as Won’t Do.
> >>>>
> >>>> wunder
> >>>> Walter Underwood
> >>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> >>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> <
> >> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>>  (my
> >> blog)
> >>>>
> >>>>> On Jun 25, 2021, at 12:46 PM, Mike Drob <md...@mdrob.com <mailto:
> >> md...@mdrob.com>> wrote:
> >>>>>
> >>>>> This was discussed somewhat in
> >>>>> https://issues.apache.org/jira/browse/SOLR-15252 <
> >> https://issues.apache.org/jira/browse/SOLR-15252><
> >> https://issues.apache.org/jira/browse/SOLR-15252 <
> >> https://issues.apache.org/jira/browse/SOLR-15252>> with no
> >>>>> implementation provided.
> >>>>>
> >>>>> On Fri, Jun 25, 2021 at 11:52 AM Walter Underwood <
> >> wun...@wunderwood.org <mailto:wun...@wunderwood.org>> wrote:
> >>>>>>
> >>>>>> I already said that we have a limit in the client code. I’m asking
> >> about a limit in Solr.
> >>>>>>
> >>>>>> wunder
> >>>>>> Walter Underwood
> >>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> >>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> <
> >> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>>  (my
> >> blog)
> >>>>>>
> >>>>>>> On Jun 25, 2021, at 11:50 AM, Håvard Wahl Kongsgård <
> >> haavard.kongsga...@gmail.com <mailto:haavard.kongsga...@gmail.com>>
> wrote:
> >>>>>>>
> >>>>>>> Just create a proxy client between the user and solr. Set if page
> >>> = 500 ….
> >>>>>>> else
> >>>>>>>
> >>>>>>> Simple stuff
> >>>>>>>
> >>>>>>> fre. 25. jun. 2021 kl. 19:20 skrev Walter Underwood <
> >> wun...@wunderwood.org <mailto:wun...@wunderwood.org>>:
> >>>>>>>
> >>>>>>>> Has anyone implemented protection against deep paging inside
> >> Solr? I’m
> >>>>>>>> thinking about something like a max_rows parameter, where if
> >> start+rows was
> >>>>>>>> greater than that, it would limit the max result to that number.
> >> Or maybe
> >>>>>>>> just return a 400, that would be OK too.
> >>>>>>>>
> >>>>>>>> I’ve had three or four outages caused by deep paging over the
> >> past dozen
> >>>>>>>> years with Solr. We implement a limit in the client code, then
> >> someone
> >>>>>>>> forgets to add it to the redesigned client code. A limit in the
> >> request
> >>>>>>>> handler would be so much easier.
> >>>>>>>>
> >>>>>>>> And yes, I know about cursor marks. We don’t want to enable deep
> >> paging,
> >>>>>>>> we want to stop it.
> >>>>>>>>
> >>>>>>>> wunder
> >>>>>>>> Walter Underwood
> >>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> >>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> >> <http://observer.wunderwood.org/ <http://observer.wunderwood.org/>>
> (my
> >> blog)
> >>>>>>>>
> >>>>>>>> --
> >>>>>>> Håvard Wahl Kongsgård
> >>>>>>> Data Scientist
> >>>>>>
> >>>>
> >>
> >>
> >
>
> --
> Ere Maijala
> Kansalliskirjasto / The National Library of Finland
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.
Here are some resources 
<https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates>
 
that can help.
 <https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>

Re: Defense against deep paging?

Reply via email to