Hello, Marmotta Users,

At the Digital Public Library of America, we have a large Marmotta
triplestore, with which we interact entirely over LDP.

We're looking for some advice about scaling Marmotta's LDP interface past
our current size. In the short term, we are hoping that we can find ways to
tune PostgreSQL to mitigate some problems we have seen; in the long term,
we are open to advice about alternate backends.

A high-level overview of how we interact with our LDP Resources is
documented in [1].  While we have had to do some LDP-specific tuning
(especially introducing a partial index on `triples.context`) for all
processes, we have seen particular trouble in cases where we GET,
transform, then PUT an LDP RDFSource (see: *Enrichment *in the overview
link).

That overview is part of a greater wiki that we've put together to document
our installation and performance-tuning activities [2].

Our biggest problem at the moment is addressing slow updates and inserts
[3], observed when we GET and PUT those RDFSources with two concurrent
mapping or enrichment activities. If we run one of these activities,
GETing, transforming, and PUTing in serial, performance seems to be network
and CPU bound, and is not very bad. But as soon as we run a second mapping
or enrichment, work performed grinds practically to a halt, as described in
[3].

To give you a sense of the scale at which we're operating, we have about
two million LDP-RSs, typically including about 50 triples and a handful of
blank nodes (around 5 to 15). Our `triples` table has about 294M rows now
and takes up 32GB for the table, and 13GB each for its two largest indices.
Our entire Marmotta database takes up about 140GB. We've had some successes
with improving index performance with low cardinality in `triples.context`
[4] and tuning the Amazon EC2 instances that we run on [5][6]. The I/O wait
problem with concurrent LDP operations, however, is the new blocker.

Some supplemental information:

* An overview of the project for which Marmotta is being
  used:
https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Heidrun

* The application (a Rails engine) that makes all of these LDP requests:
  https://github.com/dpla/KriKri

* Our configuration-management project, with details on how some of our
  stack is configured: https://github.com/dpla/automation

We'd be grateful for any feedback that you might have that would assist us
with handling large volumes of data over LDP. Thanks for your help!

- Mark Breedlove and Tom Johnson,
  Digital Public Library of America (http://dp.la/)
  t...@dp.la


[1]
https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/LDP+Interactions+Overview
[2]
https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Marmotta
[3]
https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Addressing+slow+updates+and+inserts
[4]
https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Index+performance+with+high+context+counts
[5]
https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Amazon+EC2+adjustments
[6]
https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Using+irqbalance+and+SMP+IRQ+affinity

Reply via email to