Thanks Tom for sharing with all of us your experiences; they're really valuable in many aspect.
In the end the problem comes from the idea that every LDP-R holds its own context/graph. The LDP data access relies in Sesame, and probably moving it down to a native SQL access (like we did for SPARQL) would allow us to deal more efficiently with such scenarios. As a less dramatic solution for now, in the past I've been experimented with some Postgres parameters (enable_seqscan, enable_mergejoinor enable_hashjoin) with some good results. Hope that helps. Cheers, On Fri, Apr 8, 2016 at 12:00 AM, Tom Johnson <t...@dp.la> wrote: > Reviving a somewhat dead thread with an update: > > Previously, we had run into problems with the Posgresql query planner and > its use of generic query plans for prepared queries from Marmotta. We've > been able to mitigate this in the past by: > > - Upgrading to PSQL 9.4, which is much less aggressive about using > generic query plans; and > - Adding an index that directly matches the order of the problem > queries. For reasons unknown to us, PSQL seemed to prefer an `spoc` index > to an otherwise identical `cspo` index. > > As our quad store has grown, we're now seeing this solution fail, and are > back to fighting against the query planner. > > The crux of our problem is that we use Marmotta primarily as an LDP > server; this server now contains in the millions of individual LDP-NRs, and > about twice as many LDP-RSs. The result is that we have approximately 10 > million named graphs, each containing a small number of triples (varying > from 3 to about 75), and a single LDP context with tens of millions of > triples. When we do a PUT request, the LDP context is queried with a > prepared query like: > > prepare problem_query as SELECT id FROM triples WHERE subject = $1 AND > predicate = $2 AND object = $3 AND context = $4 AND deleted = false; > > The query planner has, at various points, chosen different indexes to use > to carry out this query. The behavior we are seeing now is that it uses a > `cpo` index, then filters the subjects. Presumably, it believes that > scanning the smaller index will save it time, and in most cases, between > zero and three subjects will need to be filtered. > > When the LDP context is used, however, the result is generally that at > least 10 million subjects need to be filtered (this varies depending on > which of the predicates from the LDP graph are queried). This process makes > the query extremely expensive. Running time is counted in minutes. > > We've been able to persuade the query planner to use an `spo` index, which > runs instantly, but only when we are not past the JDBC preparedThreshold. > Our current thinking is to disable prepared statements altogether, but we > expect to see a cost for this in other queries. > > Have you all seen this before? It looks to us like it should be repeatable > for any large scale LDP marmotta service running on Postgresql, and we > thought it would be important to share our experience so this can be > patched on the Marmotta side. > > Thanks for your time, and for the work you all do. > > Best, > > Tom Johnson > Metadata & Platform Architect > Digital Public Library of America > > On Fri, Nov 6, 2015 at 3:17 AM, Sergio Fernández <wik...@apache.org> > wrote: > >> Hi Tom, >> >> On Thu, Nov 5, 2015 at 6:12 PM, Tom Johnson <t...@dp.la> wrote: >>> >>> We can confirm that upgrading to PSQL 9.2+ (we are using 9.4) has >>> significant performance benefits on LDP requests. GET requests that had >>> been taking between 10 and 50 seconds are now observed at around 1/10 of a >>> second, with much of that time apparently spent on the Marmotta side rather >>> than in PSQL. >>> >> >> Definitely PostgreSQL has brought quite some improvements in 9.4 that we >> strongly recommend it for Marmotta. >> >> After the upgrade, we saw vastly improved performance for a time, but >>> found that under a certain load similar query planning problems resurfaced. >>> Perhaps oddly, we were able to stem this by creating a new index on the >>> triples table, with a different field order from the existing four-column >>> index. The new index (on `spoc` rather than `cspo`) is used more readily in >>> the generic plan, and we've not seen slow queries (> 200ms) logged on LDP >>> requests since introducing it. >>> >> >> Those are really good news! >> >> Probably there is still room to improve the current implementation. But >> also take into account that currently the LDP module uses Sesame-based >> persistence with KiWi. Following the experience on providing a SQL native >> SPARQL evaluation strategy, we could try something similar for improving >> even more the LDP performance. But I'd prefer to keep that plan aside if >> possible. >> >> Cheers, >> >> -- >> Sergio Fernández >> Partner Technology Manager >> Redlink GmbH >> m: +43 6602747925 >> e: sergio.fernan...@redlink.co >> w: http://redlink.co >> > > -- Sergio Fernández Partner Technology Manager Redlink GmbH m: +43 6602747925 e: sergio.fernan...@redlink.co w: http://redlink.co