Thanks Tom for sharing with all of us your experiences; they're really
valuable in many aspect.

In the end the problem comes from the idea that every LDP-R holds its own
context/graph. The LDP data access relies in Sesame, and probably moving it
down to a native SQL access (like we did for SPARQL) would allow us to deal
more efficiently with such scenarios.

As a less dramatic solution for now, in the past I've been experimented
with some Postgres parameters (enable_seqscan, enable_mergejoinor
enable_hashjoin) with some good results.

Hope that helps.

Cheers,


On Fri, Apr 8, 2016 at 12:00 AM, Tom Johnson <t...@dp.la> wrote:

> Reviving a somewhat dead thread with an update:
>
> Previously, we had run into problems with the Posgresql query planner and
> its use of generic query plans for prepared queries from Marmotta. We've
> been able to mitigate this in the past by:
>
>   - Upgrading to PSQL 9.4, which is much less aggressive about using
> generic query plans; and
>   - Adding an index that directly matches the order of the problem
> queries. For reasons unknown to us, PSQL seemed to prefer an `spoc` index
> to an otherwise identical `cspo` index.
>
> As our quad store has grown, we're now seeing this solution fail, and are
> back to fighting against the query planner.
>
> The crux of our problem is that we use Marmotta primarily as an LDP
> server; this server now contains in the millions of individual LDP-NRs, and
> about twice as many LDP-RSs. The result is that we have approximately 10
> million named graphs, each containing a small number of triples (varying
> from 3 to about 75), and a single LDP context with tens of millions of
> triples. When we do a PUT request, the LDP context is queried with a
> prepared query like:
>
>    prepare problem_query  as SELECT id FROM triples WHERE subject = $1 AND
> predicate = $2 AND object = $3 AND context = $4 AND deleted = false;
>
> The query planner has, at various points, chosen different indexes to use
> to carry out this query. The behavior we are seeing now is that it uses a
> `cpo` index, then filters the subjects. Presumably, it believes that
> scanning the smaller index will save it time, and in most cases, between
> zero and three subjects will need to be filtered.
>
> When the LDP context is used, however, the result is generally that at
> least 10 million subjects need to be filtered (this varies depending on
> which of the predicates from the LDP graph are queried). This process makes
> the query extremely expensive. Running time is counted in minutes.
>
> We've been able to persuade the query planner to use an `spo` index, which
> runs instantly, but only when we are not past the JDBC preparedThreshold.
> Our current thinking is to disable prepared statements altogether, but we
> expect to see a cost for this in other queries.
>
> Have you all seen this before? It looks to us like it should be repeatable
> for any large scale LDP marmotta service running on Postgresql, and we
> thought it would be important to share our experience so this can be
> patched on the Marmotta side.
>
> Thanks for your time, and for the work you all do.
>
> Best,
>
> Tom Johnson
> Metadata & Platform Architect
> Digital Public Library of America
>
> On Fri, Nov 6, 2015 at 3:17 AM, Sergio Fernández <wik...@apache.org>
> wrote:
>
>> Hi Tom,
>>
>> On Thu, Nov 5, 2015 at 6:12 PM, Tom Johnson <t...@dp.la> wrote:
>>>
>>> We can confirm that upgrading to PSQL 9.2+ (we are using 9.4) has
>>> significant performance benefits on LDP requests. GET requests that had
>>> been taking between 10 and 50 seconds are now observed at around 1/10 of a
>>> second, with much of that time apparently spent on the Marmotta side rather
>>> than in PSQL.
>>>
>>
>> Definitely PostgreSQL has brought quite some improvements in 9.4 that we
>> strongly recommend it for Marmotta.
>>
>> After the upgrade, we saw vastly improved performance for a time, but
>>> found that under a certain load similar query planning problems resurfaced.
>>> Perhaps oddly, we were able to stem this by creating a new index on the
>>> triples table, with a different field order from the existing four-column
>>> index. The new index (on `spoc` rather than `cspo`) is used more readily in
>>> the generic plan, and we've not seen slow queries (> 200ms) logged on LDP
>>> requests since introducing it.
>>>
>>
>> Those are really good news!
>>
>> Probably there is still room to improve the current implementation. But
>> also take into account that currently the LDP module uses Sesame-based
>> persistence with KiWi. Following the experience on providing a SQL native
>> SPARQL evaluation strategy, we could try something similar for improving
>> even more the LDP performance. But I'd prefer to keep that plan aside if
>> possible.
>>
>> Cheers,
>>
>> --
>> Sergio Fernández
>> Partner Technology Manager
>> Redlink GmbH
>> m: +43 6602747925
>> e: sergio.fernan...@redlink.co
>> w: http://redlink.co
>>
>
>


-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernan...@redlink.co
w: http://redlink.co

Reply via email to