Hi Mario,

On 04/03/15 15:25, Mario Valle wrote:
I just started evaluating Marmotta. I installed it on Linux 3.3.0
selecting KiWi as the storage backend.

As a first realistic test I tried to load the entire ChEMBL 20.0 dataset
(ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/latest). This dataset is
in turtle format.

But loading through the http interface (only one file at the time):

curl -sfS -X POST -H "Content-Type: text/turtle; charset=utf-8" -d
@file.ttl http://localhost:8080/marmotta/import/upload

fails for the big files included in the dataset (for example
chembl_20.0_activity.ttl is 10.8 GB)

The error in marmotta.trace.db is:

03-03 09:50:45 jdbc[11]: exception
org.h2.jdbc.JdbcSQLException: Timeout trying to lock table ; SQL
statement: INSERT INTO nodes (id,ntype,svalue,createdAt) VALUES
(?,'uri',?,?) [50200-178]

That stacktrace reveals that you're using H2 as database for KiWi. H2 is only meant for quick installation and demo purposes, never to load 10 GB datasets.

Please, switch to PostgreSQL for a more realistic evaluation. Here some additional information you might find relevant:

http://marmotta.apache.org/configuration#db
http://wiki.apache.org/marmotta/PerformanceTuning#Using_PostgreSQL

The very labor intensive workaround I found is:

1) convert file.ttl to file.nt (n-triples)
     using riot
2) split the resulting file into chunks:
     split -a 4 -l 20000 file.nt
3) convert each chunk to RDF
     using rdf2rdf http://www.l3s.de/~minack/rdf2rdf/
4) pass each rdf file to Marmotta using
     curl -sfS -X POST -H "Content-Type: application/rdf+xml;
charset=utf-8" -d @$i http://localhost:8080/marmotta/import/upload
5) wait a huge amount of time...

That weird... for whatever reason the RDF/XML parsed introduces less parallelism that the Turtle one, so the H2 lock is not causing the same issue.

Any better idea for bulk loading?
BTW I don't understand where is the KiWiLoader mentioned in the wiki.

Exactly, KiWiLoader is the most performance path we currently have for bulk imports. You can find some documentation at:

http://wiki.apache.org/marmotta/ImportData#Import_data_directly_to_the_KiWi_triple_store

I guess the runnable jar is not part of the binary distribution, but you can easily build it from the source release or get the binary artifact from Maven Central:

http://search.maven.org/#artifactdetails%7Corg.apache.marmotta%7Ckiwi-loader%7C3.3.0%7Cjar

Depending on the machine (primarily hard disk) you can get an average of around 12.000 triples imported per second.

Hope that helps.

Cheers,

--
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 660 2747 925
e: sergio.fernan...@redlink.co
w: http://redlink.co

Reply via email to