Re: Defense against deep paging?

Vincenzo D'Amore Mon, 28 Jun 2021 02:37:53 -0700

Yep, real users rarely went beyond page 25 so we fixed a limit for paging
and for crawlers configured appropriately xml indexes in order to let them
reach all the contents.


On Mon, Jun 28, 2021 at 11:07 AM Ashwin Ramesh <ash...@canva.com.invalid>
wrote:

> Currently we also have a 100M+ index across 8+ shards. We limit paging
> after 100K results too. There is no reason why a real user wants to go from
> page 1 to page 987. The only usecase is web crawlers. How does everyone
> deail with this usecase?
>
> On Mon, Jun 28, 2021 at 3:42 PM Ere Maijala <ere.maij...@helsinki.fi>
> wrote:
>
> > I believe the answer is, as usual, "it depends". For instance we used to
> > have no issues with deep paging when we had a single-shard index. But as
> > the index grew, we added more shards and ran into trouble when the start
> > and rows params had no limits. The problem was that it took some time to
> > identify what was causing the intermittent GC storms. Had there been a
> > warning in the logs, we could have found it sooner. We currently limit
> > to paging to (start+rows) 100000 with no issues (5 shards, ~85 million
> > records).
> >
> > Some defence against deep paging in Solr itself would be of great
> > benefit. Maybe it could be tied by default to the result set handling in
> > a sharded index so that when a case is identified where memory usage is
> > going to blow up, it would cause at least a warning, but I'd like to be
> > able to set up a hard limit that could never be overridden [1]. There's
> > another Lucene-based search product that has a an even stricter default
> > limit of 10000. I think it's good idea as it forces a conscious decision
> > if the limit is raised.
> >
> > --Ere
> >
> > [1] Even if the client has a limit, without a Solr-level limit there's
> > the possibility of a bug, or a developer accidentally running a query
> > that brings the system down.
> >
> > Rahul Goswami kirjoitti 27.6.2021 klo 6.08:
> > > This begs a question...For anyone who has been burnt by the deep
> > pagination
> > > issue in the past, what is a reasonable value of "start" param beyond
> > which
> > > there is a noticeable performance degradation?
> > >
> > > Rahul
> > >
> > > On Fri, Jun 25, 2021 at 11:28 PM Walter Underwood <
> wun...@wunderwood.org
> > >
> > > wrote:
> > >
> > >> Cursors require keeping session state outside of Solr. With a million
> > >> queries per hour and the middle tier spread across lots of containers,
> > that
> > >> isn’t practical. Stateless searches are the default in Solr for a good
> > >> reason.
> > >>
> > >> Using start and rows works great. The only issue is that Solr is
> > >> defenseless against deep paging.
> > >>
> > >> wunder
> > >> Walter Underwood
> > >> wun...@wunderwood.org
> > >> http://observer.wunderwood.org/  (my blog)
> > >>
> > >>> On Jun 25, 2021, at 8:09 PM, Dwane Hall <dwaneh...@hotmail.com>
> wrote:
> > >>>
> > >>> Ok we lock down the rows and start params and then use cursors (which
> > >> you don't want to use) for paging in increments of the page size.  It
> > works
> > >> nicely for us but it sounds like it's not workable solution for you.
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Dwane
> > >>> From: Walter Underwood <wun...@wunderwood.org <mailto:
> > >> wun...@wunderwood.org>>
> > >>> Sent: Saturday, 26 June 2021 12:53 PM
> > >>> To: users@solr.apache.org <mailto:users@solr.apache.org> <
> > >> users@solr.apache.org <mailto:users@solr.apache.org>>
> > >>> Subject: Re: Defense against deep paging?
> > >>>
> > >>> The start parameter needs to be read from the request. That is how
> the
> > >> client gets to the second page of results, by setting start=10 or
> > start=20.
> > >> The problem is when a bot sneaks through the checks and Solr gets
> > >> start=3990000. A few of those will use all of heap and take down the
> > server
> > >> process.
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> > >>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> (my
> > >> blog)
> > >>>
> > >>>> On Jun 25, 2021, at 6:40 PM, Dwane Hall <dwaneh...@hotmail.com
> > >> <mailto:dwaneh...@hotmail.com>> wrote:
> > >>>>
> > >>>> Hey Walter,
> > >>>>
> > >>>> Can you set the value for start (0) and rows (your default sensible
> > >> response row size) as an invariant in the request handler you're using
> > so
> > >> it can't be overridden from a client request? That's how I've defended
> > >> against it from Solr's perspective in the past. This can be hard coded
> > in
> > >> your request handler in the XML of your solr-config or using the
> > parameters
> > >> API. I've found it simple but effective approach and there's an
> example
> > >> here from the docs (
> > >>
> >
> https://solr.apache.org/guide/8_8/requesthandlers-and-searchcomponents-in-solrconfig.html#request-handlers
> > >> <
> > >>
> >
> https://solr.apache.org/guide/8_8/requesthandlers-and-searchcomponents-in-solrconfig.html#request-handlers
> > >>> ).
> > >>>>
> > >>>> Thanks,
> > >>>>
> > >>>> Dwane
> > >>>> From: Walter Underwood <wun...@wunderwood.org <mailto:
> > >> wun...@wunderwood.org>>
> > >>>> Sent: Saturday, 26 June 2021 6:39 AM
> > >>>> To: users@solr.apache.org <mailto:users@solr.apache.org> <
> > >> users@solr.apache.org <mailto:users@solr.apache.org>>
> > >>>> Subject: Re: Defense against deep paging?
> > >>>>
> > >>>> Thanks, that is exactly the info I wanted! I’ve commented there,
> even
> > >> though it is closed as Won’t Do.
> > >>>>
> > >>>> wunder
> > >>>> Walter Underwood
> > >>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> > >>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> <
> > >> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>>
> (my
> > >> blog)
> > >>>>
> > >>>>> On Jun 25, 2021, at 12:46 PM, Mike Drob <md...@mdrob.com <mailto:
> > >> md...@mdrob.com>> wrote:
> > >>>>>
> > >>>>> This was discussed somewhat in
> > >>>>> https://issues.apache.org/jira/browse/SOLR-15252 <
> > >> https://issues.apache.org/jira/browse/SOLR-15252><
> > >> https://issues.apache.org/jira/browse/SOLR-15252 <
> > >> https://issues.apache.org/jira/browse/SOLR-15252>> with no
> > >>>>> implementation provided.
> > >>>>>
> > >>>>> On Fri, Jun 25, 2021 at 11:52 AM Walter Underwood <
> > >> wun...@wunderwood.org <mailto:wun...@wunderwood.org>> wrote:
> > >>>>>>
> > >>>>>> I already said that we have a limit in the client code. I’m asking
> > >> about a limit in Solr.
> > >>>>>>
> > >>>>>> wunder
> > >>>>>> Walter Underwood
> > >>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> > >>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> <
> > >> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>>
> (my
> > >> blog)
> > >>>>>>
> > >>>>>>> On Jun 25, 2021, at 11:50 AM, Håvard Wahl Kongsgård <
> > >> haavard.kongsga...@gmail.com <mailto:haavard.kongsga...@gmail.com>>
> > wrote:
> > >>>>>>>
> > >>>>>>> Just create a proxy client between the user and solr. Set if page
> > >>> = 500 ….
> > >>>>>>> else
> > >>>>>>>
> > >>>>>>> Simple stuff
> > >>>>>>>
> > >>>>>>> fre. 25. jun. 2021 kl. 19:20 skrev Walter Underwood <
> > >> wun...@wunderwood.org <mailto:wun...@wunderwood.org>>:
> > >>>>>>>
> > >>>>>>>> Has anyone implemented protection against deep paging inside
> > >> Solr? I’m
> > >>>>>>>> thinking about something like a max_rows parameter, where if
> > >> start+rows was
> > >>>>>>>> greater than that, it would limit the max result to that number.
> > >> Or maybe
> > >>>>>>>> just return a 400, that would be OK too.
> > >>>>>>>>
> > >>>>>>>> I’ve had three or four outages caused by deep paging over the
> > >> past dozen
> > >>>>>>>> years with Solr. We implement a limit in the client code, then
> > >> someone
> > >>>>>>>> forgets to add it to the redesigned client code. A limit in the
> > >> request
> > >>>>>>>> handler would be so much easier.
> > >>>>>>>>
> > >>>>>>>> And yes, I know about cursor marks. We don’t want to enable deep
> > >> paging,
> > >>>>>>>> we want to stop it.
> > >>>>>>>>
> > >>>>>>>> wunder
> > >>>>>>>> Walter Underwood
> > >>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> > >>>>>>>> http://observer.wunderwood.org/ <
> http://observer.wunderwood.org/>
> > >> <http://observer.wunderwood.org/ <http://observer.wunderwood.org/>>
> > (my
> > >> blog)
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>> Håvard Wahl Kongsgård
> > >>>>>>> Data Scientist
> > >>>>>>
> > >>>>
> > >>
> > >>
> > >
> >
> > --
> > Ere Maijala
> > Kansalliskirjasto / The National Library of Finland
> >
>
> --
> **
> ** <https://www.canva.com/>Empowering the world to design
> Share accurate
> information on COVID-19 and spread messages of support to your community.
> Here are some resources
> <
> https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates>
>
> that can help.
>  <https://twitter.com/canva> <https://facebook.com/canva>
> <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> <https://instagram.com/canva>
>
>
>
>
>
>
>
>
>
>
>

-- 
Vincenzo D'Amore

Re: Defense against deep paging?

Reply via email to