Hi Joshua, there is not exactly a howto, but the main steps would be: 1) setup the backend server that is going to be used by Titan (e.g. HBase) 2) use the MarmottaLoader (single jar) for the respective backend (e.g. marmotta-loader-hbase) to load your triple data into it in bulk mode 3) copy the launchers/marmotta-webapp folder to a separate project directory and update the pom.xml so that instead of marmotta-backend-kiwi it uses marmotta-backend-titan; remove all other dependencies to kiwi modules (versioning, reasoning) 4) run mvn clean tomcat7:run in that directory 5) go to configuration and set properties according to the titan documentation; if a property in Titan is called FOO, the configuration name in Marmotta is titan.FOO
In step 5 you would e.g. switch from the BerkeleyDB Titan backend used by default to HBase. Follow the documentation on the Titan web page from there. Greetings, Sebastian 2014-11-06 22:44 GMT+01:00 Joshua Dunham <joshua_dun...@vrtx.com>: > Hi Sebastian, Marmotta Users, > > > Reviving this thread a little bit, is there a howto for trying out > TitanDB? I'm interested in titan/cassandra doing some of the heavy lifting > for the backend work. Titan espicially has some **very** interesting > features (TinkerPop apps and native hadoop2 support). > > I could be a guinea pig in this regard. :) > > -J > > > > > > Joshua Dunham > Exploratory App Development | Vertex > E: joshua_dun...@vrtx.com > P: 617-229-5157 > W: http://www.vrtx.com > L: http://linkedin.com/in/joshuadunham > S: joshua.dunham > > From: Sebastian Schaffert [sebastian.schaff...@gmail.com] > > Sent: Tuesday, October 07, 2014 06:45 > > To: users@marmotta.apache.org > > Subject: Re: Zookeeper, Giraph, and WebProtege > > > > > > > Hi Joshua, > > > > 2014-10-06 22:42 GMT+02:00 Joshua Dunham > <joshua_dun...@vrtx.com>: > > > Hi Sebastian, > > > > Thank you for chiming in. Some clarifying questions below. > > > > > > Trying to answer :-) > > > > >>> 2. Which is the preferable backend? I’m not familiar with the > pros/cons of the options but I think looking around at some docs that > Hazlecast is a ‘safe’ good bet? > > > > > >> We currently support Guava and Ehcache for local caches, Hazelcast, > and Infinispan for clusters. AFAIK currently Hazelcast is the most stable > and tested one, and it's currently used in production. > > > > > > Guava for single-instance setups, otherwise Hazelcast. The other > backends are more experimental. Infinispan is powerful in large setups, > because it also supports dedicated cluster servers (HotRod Server), but > this has not been tested extensively and is significantly > more complex and has more overhead. EHCache has a bit more intelligent > memory management (i.e. it expires cached objects based on the memory they > occupy, while all other backends simply take object counts, so when you > have many large objects you might run > into out-of-memory situations), but otherwise introduces more overhead > than Guava. > > > > > > > So to get started I would make two instances of tomcat and two instances > of marmotta (ideally separate hardware). I would configure the cluster > settings on both as such, > > > > clustering.address = 226.6.7.8 > > clusting.backend = hazlecast > > clustering.enabled = on > > clustering.mode = replicated > > clustering.name > = marmotta > > clustering.port = 46655 > > > > > > There is a typo there ("hazelcast"). Also you should change > clustering.mode to "distributed", because Hazelcast does not support > "replicated". Not a problem if you forget it, but you will get a warning in > the log. ;-) > > > > > while making sure that .name and .port are not in use. I would then > connect them to the same database backend from the default H2. (I need to > use mySQL, we predominantly use Oracle, not pSQL). > > > > > > You would then connect to the same database server (not sure if you > referred to this when saying "backend"). More elegant is to set the > database.url in the Zookeeper tree and let Marmotta retrieve it from there. > > > > I am not sure about MySQL (we are not really working much with it), but > you are welcome to try (and maybe tell us your experiences). > > > - Or would I connect mySQL db first and then do the cluster config? > > > > > > The order doesn't really matter as long as no users are really accessing > the system in parallel. > > > - If I installed and configure one instance to use clustering first, > configure it to use zookeeper, configure the backend DB; then installed a > new instance of marmotta and hooked that to zookeeper, is magic going to > happen and it will sort out all the settings > for each? > > > > > > That's the plan (this is how we use it). ;-) > > > > Unfortunately, all configuration you store via the user interface will > only be stored in the local configuration file, because Marmotta does not > know whether your configuration setting should affect all marmotta > instances, those in a cluster, or only the > single instance you are configuring. > > > > For this to work you need to make sure you understand the way > configuration is stored in Zookeeper. Look at > http://marmotta.apache.org/platform/cloud.html; > the Zookeeper tree contains three levels of Marmotta configurations: > - global level contains configuration applied to all Marmotta instances > using this Zookeeper server > - cluster level contains configuration applied to all Marmotta instances > in a named cluster (Servlet context init parameter zookeeper.cluster), e.g. > for database URL > - instance level contains configuration applied to a single Marmotta > instance (e.g. for turning on logging or such things) > > > > If you are interested I can send you a sample dump of a Zookeeper tree we > are using. > > > > > > > > If the 'master' is > data.example.com and the 'replicant' is > data-02.example.com, I still make my rdf with an IRI of the master? Will > the replicant have a different IRI and thus I would need to reference a > different resource at that endpoint? Or do I ‘cheat’ and put the same IRI > in the system-config.properties > > > > > > You would put a load balancer (e.g. standard Apache HTTPD) in front of > your two tomcats and configure the Marmottas to use the same IRI for > resources (which has to match the IRI of your load balancer). You can do > this by manually setting the configuration > variables kiwi.context (sets the prefix for constructing URIs) and > kiwi.host (sets the prefix for accessing the admin UI). > > > > > Since I already have one instance up and containing data, would it work to > bring the second instance online, configure the cluster settings of both > and then the database and have it synchronize the data? > > > > > > Since they are both accessing the same database, there is no need to > synchronize manually. Caches are then automatically synchronized by > Hazelcast (as soon as it has finished discovering its peers). > > > > > > Clustering in Marmotta generally won't help you with big datasets. But > it will help you with high concurrent loads. The clustering functionality > currently implemented essentially provides two features: > > > - a cluster-wide cache so that database lookups for frequently used > nodes and triples can be reduced; this won't help you if you are always > requesting different data or run SPARQL queries; it will help you if you > are repeatedly accessing the same nodes and > triples > > > - a cluster-wide synchronization and locking mechanism to make sure the > cluster members all share the same data and no inconsistencies are created; > this will actually SLOW DOWN your single-process operations and is useful > only in highly concurrent setups > > > > > > If you want to improve performance for single-dataset single-user > situations, don't use the clustering mechanism. Use and tune the PostgreSQL > database backend instead. Make sure you read > > http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server > > > > > > > I'm not super concerned with speed at the moment. My goal for phase 1 is > to develop an ontology for the > > applications that will hook into the system. Phase 1.5 is making LDPath > apps that will return select entries via json to use in a webapp. The > webapp collects data from these curated lists and submits the changes back > to marmotta. My issue is that even the uniprot > dataset is 85GB compressed rdf-xml so it will be difficult to work with > the amount of data I would need to properly connect LDPath apps to clients. > I was hoping I could have at least different contexts if not an even spread > of data among the instances. > > > > > > > > > 85 GB should not be terribly much, especially for read access. Did you try > loading it with the marmotta loader? It will take some time, but should > work. Make sure to tune your database first, though (not sure about the > optimal settings for MySQL). > > > > > >>> I also found the Apache Giraph project which claims to offer native > node/edge processing for graph databases. Has anyone used this? I would be > *very* interested to play around if it could connect to Marmotta. > > > > > >> We have a experimental backend that uses Titan DB. If you'd be great if > someone could evolve Marmotta in that direction! > > > > > > Giraph serves a different purpose, it is a highly scalable graph > processing framework, not a database. As such, it allows you to parallelize > typical graph operations (like shortest path computations) and run them on > a Hadoop cluster. This is totally different > to the kind of operations needed by Marmotta (e.g. to support SPARQL > querying). If you would like to have a clustered database backend, you > could try the Titan backend with HBase or Cassandra, but I am not > completely convinced it will be faster than PostgreSQL. > > > > > > > Yes, I have been looking into Giraph to ideally serve as a native > processing engine to Marmotta. As I start to fill the database with > connections I make (on top of established datasets like uniprot) I was > looking into apps that could start to find trends > in my data. It would be ideal to have an app connect in natively rather > than trying to load some new permutation of data into hadoop and running it > there. My use case is that I will have **many** triples with a known > predicate (middle column) and a plain literal > value for value (third column). I would like it to work back to all known > connections and start to find commonalities. It’s easy in principle to > think about and not super difficult to program. Think principle component > analysis for each related node of a source > node. > > > > > > You could also work directly on the Marmotta database in SQL. The triple > format is easy enough to understand. ;-) > > > > > >>> Lastly, What are people using to manage there ontologies? I found > Protege a while back and installed WebProtege to manage ontologies. Is it > possible that it connects to marmotta to keep the ontology synchronized? > Are there any cool things WebProtege (or > any ontology manager) can do with Marmotta? > > >>> > > > I am using emacs for managing ontologies ;-) > > > > > > > So, using an ‘offline’ ontology manager means you would make your changes > to terms, save out a versioned copy, diff it against the last export and > run it against your marmotta to establish the new ontology in DB? > > > > > > > > > Depends, probably nowadays I would use SPARQL Update queries to apply > changes to the ontology. I am not aware of any nice tools for this, though. > > Greetings, > > > > Sebastian > > > > > > > This email message and any attachments are confidential and intended for > use by the addressee(s) only. If you are not the intended recipient, please > notify me immediately by replying to this message, and destroy all copies > of this message and any attachments. Thank you. >