Hi Sebastian, Thank you for chiming in. Some clarifying questions below.
> From: Sebastian Schaffert [sebastian.schaff...@gmail.com] > Sent: Monday, October 06, 2014 07:24 > To: users@marmotta.apache.org > Subject: Re: Zookeeper, Giraph, and WebProtege > > Hi Joshua, > > as the person responsible for the clustering and caching, let me add a bit to > Sergio's explanation: > >> 2014-10-03 9:12 GMT+02:00 Sergio Fernández <wik...@apache.org>: >> Hi Joshua, > >> On 01/10/14 16:00, Joshua Dunham wrote: >>> It looks like there are quite a few options to configure the cluster. >>> >> Yes, you have the details at: http://marmotta.apache.org/platform/cloud >> >>> Can someone answer, >>> 1. First let me clarify, the clustering options in Marmotta > Core > >>> Settings > clustering.{address,backend,enabled,mode} need to be configured >>> when using Zookeeper? >>> > Zookeeper comes to complement the regular configuration for cloud-based > installations, where several nodes can read the same configuration. > > In Marmotta, there are two independent functionalities that are related to > running a cluster of installations: > - the ZooKeeper integration allows a *central configuration management* for > Marmotta instances; of course, this makes most sense if you are running a > cluster, but in theory can also be used to run many individual independent > instances > - the clustered caching (configurations starting with clustering. as you > correctly observed) is responsible for making sure Marmotta runs properly in > a load-balancing setup by keeping caches in sync and providing appropriate > cluster-wide locking to ensure synchronization between instances > > >>> 2. Which is the preferable backend? I’m not familiar with the pros/cons of >>> the options but I think looking around at some docs that Hazlecast is a >>> ‘safe’ good bet? > >> We currently support Guava and Ehcache for local caches, Hazelcast, and >> Infinispan for clusters. AFAIK currently Hazelcast is the most stable and >> tested one, and it's currently used in production. > > Guava for single-instance setups, otherwise Hazelcast. The other backends are > more experimental. Infinispan is powerful in large setups, because it also > supports dedicated cluster servers (HotRod Server), but this has not been > tested extensively and is significantly more complex and has more overhead. > EHCache has a bit more intelligent memory management (i.e. it expires cached > objects based on the memory they occupy, while all other backends simply take > object counts, so when you have many large objects you might run into > out-of-memory situations), but otherwise introduces more overhead than Guava. > So to get started I would make two instances of tomcat and two instances of marmotta (ideally separate hardware). I would configure the cluster settings on both as such, clustering.address = 226.6.7.8 clusting.backend = hazlecast clustering.enabled = on clustering.mode = replicated clustering.name = marmotta clustering.port = 46655 while making sure that .name and .port are not in use. I would then connect them to the same database backend from the default H2. (I need to use mySQL, we predominantly use Oracle, not pSQL). - Or would I connect mySQL db first and then do the cluster config? - If I installed and configure one instance to use clustering first, configure it to use zookeeper, configure the backend DB; then installed a new instance of marmotta and hooked that to zookeeper, is magic going to happen and it will sort out all the settings for each? If the 'master' is data.example.com and the 'replicant' is data-02.example.com, I still make my rdf with an IRI of the master? Will the replicant have a different IRI and thus I would need to reference a different resource at that endpoint? Or do I ‘cheat’ and put the same IRI in the system-config.properties Since I already have one instance up and containing data, would it work to bring the second instance online, configure the cluster settings of both and then the database and have it synchronize the data? >>> 3. There are three options for mode. Based on the description I would say >>> that distributed is what I want but there is a third option ‘Replicated’ >>> which is not described. What exactly does this do? >> >> Yes, it accepts those three values: >> >> * In LOCAL cache mode, the cache is not shared among the servers in a >> cluster. Each machine keeps a local cache. This allows quick startups and >> eliminates network traffic in the cluster, but subsequent requests to >> different cluster members cannot benefit from the cached data. >> > It it even worse, because even synchronization features among cluster members > will not be available then. In short: don't use LOCAL when you are running a > cluster. > > >> * In DISTRIBUTED cache mode, the cluster forms a big hash table used as a >> cache. This allows to make efficient use of the large amount of memory >> available. >> >> * In REPLICATED cache model all nodes of the cluster hold a complete cache >> that is automatically replicated. This makes more efficient operations that >> require a traversal through the whole graph, such as SPARQL querying. >> >> I think the decision about the mode depends more on the concrete needs and >> backend used. > >>> My datasets are too large to run on one instance I think and I would like >>> to become familiar with the clustering options Marmotta offers. If I wanted >>> to have N number of instances running, each has a portion of the total >>> dataset is this possible? Ideally there is some sort of master that I query >>> and it will collect the triples regardless of the server the data is on. >>> I’ve seen the walkthrough at the Marmotta site but wanted to see if that >>> will get me where I’d like to be. :) > >> That's exactly the idea. Just provide sufficient resources for the database. >> > Clustering in Marmotta generally won't help you with big datasets. But it > will help you with high concurrent loads. The clustering functionality > currently implemented essentially provides two features: > - a cluster-wide cache so that database lookups for frequently used nodes and > triples can be reduced; this won't help you if you are always requesting > different data or run SPARQL queries; it will help you if you are repeatedly > accessing the same nodes and triples > - a cluster-wide synchronization and locking mechanism to make sure the > cluster members all share the same data and no inconsistencies are created; > this will actually SLOW DOWN your single-process operations and is useful > only in highly concurrent setups > > If you want to improve performance for single-dataset single-user situations, > don't use the clustering mechanism. Use and tune the PostgreSQL database > backend instead. Make sure you read > http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server > I'm not super concerned with speed at the moment. My goal for phase 1 is to develop an ontology for the applications that will hook into the system. Phase 1.5 is making LDPath apps that will return select entries via json to use in a webapp. The webapp collects data from these curated lists and submits the changes back to marmotta. My issue is that even the uniprot dataset is 85GB compressed rdf-xml so it will be difficult to work with the amount of data I would need to properly connect LDPath apps to clients. I was hoping I could have at least different contexts if not an even spread of data among the instances. >>> I also found the Apache Giraph project which claims to offer native >>> node/edge processing for graph databases. Has anyone used this? I would be >>> *very* interested to play around if it could connect to Marmotta. > >> We have a experimental backend that uses Titan DB. If you'd be great if >> someone could evolve Marmotta in that direction! > > Giraph serves a different purpose, it is a highly scalable graph processing > framework, not a database. As such, it allows you to parallelize typical > graph operations (like shortest path computations) and run them on a Hadoop > cluster. This is totally different to the kind of operations needed by > Marmotta (e.g. to support SPARQL querying). If you would like to have a > clustered database backend, you could try the Titan backend with HBase or > Cassandra, but I am not completely convinced it will be faster than > PostgreSQL. > Yes, I have been looking into Giraph to ideally serve as a native processing engine to Marmotta. As I start to fill the database with connections I make (on top of established datasets like uniprot) I was looking into apps that could start to find trends in my data. It would be ideal to have an app connect in natively rather than trying to load some new permutation of data into hadoop and running it there. My use case is that I will have **many** triples with a known predicate (middle column) and a plain literal value for value (third column). I would like it to work back to all known connections and start to find commonalities. It’s easy in principle to think about and not super difficult to program. Think principle component analysis for each related node of a source node. >>> Lastly, What are people using to manage there ontologies? I found Protege a >>> while back and installed WebProtege to manage ontologies. Is it possible >>> that it connects to marmotta to keep the ontology synchronized? Are there >>> any cool things WebProtege (or any ontology manager) can do with Marmotta? >>> > I am using emacs for managing ontologies ;-) > So, using an ‘offline’ ontology manager means you would make your changes to terms, save out a versioned copy, diff it against the last export and run it against your marmotta to establish the new ontology in DB? > I hope I could clarify a bit more, > > Sebastian Thanks! It was useful! -J This email message and any attachments are confidential and intended for use by the addressee(s) only. If you are not the intended recipient, please notify me immediately by replying to this message, and destroy all copies of this message and any attachments. Thank you.