Re: Scaling Marmotta's LDP interface

Mark A. Matienzo Sat, 19 Sep 2015 12:03:04 -0700

Hi Raffaele,

I'm writing on behalf of Mark and Tom, who are currently in transit to a
conference. To be clear, we're not immediately looking to use a different
backend, although we may consider that at a later date. As Mark and Tom
noted above, we're having some serious performance issues in our production
environment, which runs Postgres, and this is where we're seeking advice on
improving that performance.


Our immediate concern is the notably poor performance of the LDP
implementation when we have more than one job interacting with it. The
Marmotta documentation strongly recommends PostgreSQL in several places
[A], [B], [C]. In particular, the wiki page on performance tuning [C] seems
to suggest that an "adequately-tuned" Postgres server will give us
reasonable performance.

Thanks again, and please feel free to follow up with us if you have
additional questions.

[A] http://marmotta.apache.org/configuration.html
[B] http://marmotta.apache.org/kiwi/triplestore.html
[C] http://wiki.apache.org/marmotta/PerformanceTuning


Best,

Mark A. Matienzo, Director of Technology
Digital Public Library of America | http://dp.la
m...@dp.la

On Sat, Sep 19, 2015 at 12:27 PM, Raffaele Palmieri <
raffaele.palmi...@gmail.com> wrote:

> Hello Mark and Tom, in the past some big data backends were implemented
> but not distributed in official release, such as Big Data Triple Store, now
> Blazegraph [1], Titan [2] or Accumulo Graph [3]. Personally I tested this
> last integration option, and it worked fine, although I didn't any
> benchmark.
> To build Marmotta with a different backend, you need build it with the
> corresponding profile in maven configurator and set the specific properties.
> Hoping that I've given you useful information,
> regards,
> Raffaele.
>
> [1] https://www.blazegraph.com/
> [2] http://thinkaurelius.github.io/titan/
> [3] https://github.com/JHUAPL/AccumuloGraph
>
>
>
>
> 2015-09-18 21:37 GMT+02:00 Mark Breedlove <m...@dp.la>:
>
>>
>> Hello, Marmotta Users,
>>
>> At the Digital Public Library of America, we have a large Marmotta
>> triplestore, with which we interact entirely over LDP.
>>
>> We're looking for some advice about scaling Marmotta's LDP interface past
>> our current size. In the short term, we are hoping that we can find ways to
>> tune PostgreSQL to mitigate some problems we have seen; in the long term,
>> we are open to advice about alternate backends.
>>
>> A high-level overview of how we interact with our LDP Resources is
>> documented in [1].  While we have had to do some LDP-specific tuning
>> (especially introducing a partial index on `triples.context`) for all
>> processes, we have seen particular trouble in cases where we GET,
>> transform, then PUT an LDP RDFSource (see: *Enrichment *in the overview
>> link).
>>
>> That overview is part of a greater wiki that we've put together to
>> document our installation and performance-tuning activities [2].
>>
>> Our biggest problem at the moment is addressing slow updates and inserts
>> [3], observed when we GET and PUT those RDFSources with two concurrent
>> mapping or enrichment activities. If we run one of these activities,
>> GETing, transforming, and PUTing in serial, performance seems to be network
>> and CPU bound, and is not very bad. But as soon as we run a second mapping
>> or enrichment, work performed grinds practically to a halt, as described in
>> [3].
>>
>> To give you a sense of the scale at which we're operating, we have about
>> two million LDP-RSs, typically including about 50 triples and a handful of
>> blank nodes (around 5 to 15). Our `triples` table has about 294M rows now
>> and takes up 32GB for the table, and 13GB each for its two largest indices.
>> Our entire Marmotta database takes up about 140GB. We've had some successes
>> with improving index performance with low cardinality in `triples.context`
>> [4] and tuning the Amazon EC2 instances that we run on [5][6]. The I/O wait
>> problem with concurrent LDP operations, however, is the new blocker.
>>
>> Some supplemental information:
>>
>> * An overview of the project for which Marmotta is being
>>   used:
>> https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Heidrun
>>
>> * The application (a Rails engine) that makes all of these LDP requests:
>>   https://github.com/dpla/KriKri
>>
>> * Our configuration-management project, with details on how some of our
>>   stack is configured: https://github.com/dpla/automation
>>
>> We'd be grateful for any feedback that you might have that would assist
>> us with handling large volumes of data over LDP. Thanks for your help!
>>
>> - Mark Breedlove and Tom Johnson,
>>   Digital Public Library of America (http://dp.la/)
>>   t...@dp.la
>>
>>
>> [1]
>> https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/LDP+Interactions+Overview
>> [2]
>> https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Marmotta
>> [3]
>> https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Addressing+slow+updates+and+inserts
>> [4]
>> https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Index+performance+with+high+context+counts
>> [5]
>> https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Amazon+EC2+adjustments
>> [6]
>> https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Using+irqbalance+and+SMP+IRQ+affinity
>>
>
>

Re: Scaling Marmotta's LDP interface

Reply via email to