Hi Joshua,
On 01/10/14 16:00, Joshua Dunham wrote:
It looks like there are quite a few options to configure the cluster.
Yes, you have the details at: http://marmotta.apache.org/platform/cloud
Can someone answer,
1. First let me clarify, the clustering options in Marmotta > Core > Settings >
clustering.{address,backend,enabled,mode} need to be configured when using Zookeeper?
Zookeeper comes to complement the regular configuration for cloud-based
installations, where several nodes can read the same configuration.
Marmotta would expect the global configuration at /marmotta/config/* in
ZooKeeoper, although particular cluster configurations can be specified
at /marmotta/clusters/:name/config/*. More details in the link provided
above.
For managing configuration in Zookeeper you may find useful this tool
written by Thomas: https://bitbucket.org/srfgkmt/zoomanager
2. Which is the preferable backend? I’m not familiar with the pros/cons of the
options but I think looking around at some docs that Hazlecast is a ‘safe’ good
bet?
We currently support Guava and Ehcache for local caches, Hazelcast, and
Infinispan for clusters. AFAIK currently Hazelcast is the most stable
and tested one, and it's currently used in production.
3. There are three options for mode. Based on the description I would say that
distributed is what I want but there is a third option ‘Replicated’ which is
not described. What exactly does this do?
Yes, it accepts those three values:
* In LOCAL cache mode, the cache is not shared among the servers in a
cluster. Each machine keeps a local cache. This allows quick startups
and eliminates network traffic in the cluster, but subsequent requests
to different cluster members cannot benefit from the cached data.
* In DISTRIBUTED cache mode, the cluster forms a big hash table used as
a cache. This allows to make efficient use of the large amount of memory
available.
* In REPLICATED cache model all nodes of the cluster hold a complete
cache that is automatically replicated. This makes more efficient
operations that require a traversal through the whole graph, such as
SPARQL querying.
I think the decision about the mode depends more on the concrete needs
and backend used.
4. What is the best way to set the address? I think this would depend on the
backend mostly and also the network the server is in but I’m not sure what the
rules are.
The port used for the cluster. Basically that's a mechanisms to avoid
address clashing. Just use be sure it is available when you configure a
new cluster. A value <= 0 will use the default port.
My datasets are too large to run on one instance I think and I would like to
become familiar with the clustering options Marmotta offers. If I wanted to
have N number of instances running, each has a portion of the total dataset is
this possible? Ideally there is some sort of master that I query and it will
collect the triples regardless of the server the data is on. I’ve seen the
walkthrough at the Marmotta site but wanted to see if that will get me where
I’d like to be. :)
That's exactly the idea. Just provide sufficient resources for the database.
I also found the Apache Giraph project which claims to offer native node/edge
processing for graph databases. Has anyone used this? I would be *very*
interested to play around if it could connect to Marmotta.
We have a experimental backend that uses Titan DB. If you'd be great if
someone could evolve Marmotta in that direction!
Lastly, What are people using to manage there ontologies? I found Protege a
while back and installed WebProtege to manage ontologies. Is it possible that
it connects to marmotta to keep the ontology synchronized? Are there any cool
things WebProtege (or any ontology manager) can do with Marmotta?
Sorry, I'm not familiar with WebProtege. It just needs to implement a
writing method compatible with Marmotta (file, REST, SPARQL or LDP), and
then you can you have it.
If you juist need SKOS, this other tool can be relevant for you:
https://github.com/tkurz/skosjs . It just needs a SPARQL 1.1 endpoint to
edit your thesauri. The same workflow more or less would need to be in
place if you want to use WebProtege.
Hope that helps.
Cheers,
--
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 660 2747 925
e: sergio.fernan...@redlink.co
w: http://redlink.co