Hello Mark and Tom, in the past some big data backends were implemented but not distributed in official release, such as Big Data Triple Store, now Blazegraph [1], Titan [2] or Accumulo Graph [3]. Personally I tested this last integration option, and it worked fine, although I didn't any benchmark. To build Marmotta with a different backend, you need build it with the corresponding profile in maven configurator and set the specific properties. Hoping that I've given you useful information, regards, Raffaele.
[1] https://www.blazegraph.com/ [2] http://thinkaurelius.github.io/titan/ [3] https://github.com/JHUAPL/AccumuloGraph 2015-09-18 21:37 GMT+02:00 Mark Breedlove <m...@dp.la>: > > Hello, Marmotta Users, > > At the Digital Public Library of America, we have a large Marmotta > triplestore, with which we interact entirely over LDP. > > We're looking for some advice about scaling Marmotta's LDP interface past > our current size. In the short term, we are hoping that we can find ways to > tune PostgreSQL to mitigate some problems we have seen; in the long term, > we are open to advice about alternate backends. > > A high-level overview of how we interact with our LDP Resources is > documented in [1]. While we have had to do some LDP-specific tuning > (especially introducing a partial index on `triples.context`) for all > processes, we have seen particular trouble in cases where we GET, > transform, then PUT an LDP RDFSource (see: *Enrichment *in the overview > link). > > That overview is part of a greater wiki that we've put together to > document our installation and performance-tuning activities [2]. > > Our biggest problem at the moment is addressing slow updates and inserts > [3], observed when we GET and PUT those RDFSources with two concurrent > mapping or enrichment activities. If we run one of these activities, > GETing, transforming, and PUTing in serial, performance seems to be network > and CPU bound, and is not very bad. But as soon as we run a second mapping > or enrichment, work performed grinds practically to a halt, as described in > [3]. > > To give you a sense of the scale at which we're operating, we have about > two million LDP-RSs, typically including about 50 triples and a handful of > blank nodes (around 5 to 15). Our `triples` table has about 294M rows now > and takes up 32GB for the table, and 13GB each for its two largest indices. > Our entire Marmotta database takes up about 140GB. We've had some successes > with improving index performance with low cardinality in `triples.context` > [4] and tuning the Amazon EC2 instances that we run on [5][6]. The I/O wait > problem with concurrent LDP operations, however, is the new blocker. > > Some supplemental information: > > * An overview of the project for which Marmotta is being > used: > https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Heidrun > > * The application (a Rails engine) that makes all of these LDP requests: > https://github.com/dpla/KriKri > > * Our configuration-management project, with details on how some of our > stack is configured: https://github.com/dpla/automation > > We'd be grateful for any feedback that you might have that would assist us > with handling large volumes of data over LDP. Thanks for your help! > > - Mark Breedlove and Tom Johnson, > Digital Public Library of America (http://dp.la/) > t...@dp.la > > > [1] > https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/LDP+Interactions+Overview > [2] > https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Marmotta > [3] > https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Addressing+slow+updates+and+inserts > [4] > https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Index+performance+with+high+context+counts > [5] > https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Amazon+EC2+adjustments > [6] > https://digitalpubliclibraryofamerica.atlassian.net/wiki/display/TECH/Using+irqbalance+and+SMP+IRQ+affinity >