RE: Zookeeper, Giraph, and WebProtege

Joshua Dunham Thu, 06 Nov 2014 13:45:35 -0800

Hi Sebastian, Marmotta Users,


Reviving this thread a little bit, is there a howto for trying out TitanDB? I'm 
interested in titan/cassandra doing some of the heavy lifting for the backend 
work. Titan espicially has some **very** interesting features (TinkerPop apps 
and native hadoop2 support).

I could be a guinea pig in this regard. :)

-J





Joshua Dunham
Exploratory App Development | Vertex
E: joshua_dun...@vrtx.com
P: 617-229-5157
W: http://www.vrtx.com
L: http://linkedin.com/in/joshuadunham
S: joshua.dunham

From: Sebastian Schaffert [sebastian.schaff...@gmail.com]

Sent: Tuesday, October 07, 2014 06:45

To: users@marmotta.apache.org

Subject: Re: Zookeeper, Giraph, and WebProtege






Hi Joshua,



2014-10-06 22:42 GMT+02:00 Joshua Dunham
<joshua_dun...@vrtx.com>:


Hi Sebastian,



   Thank you for chiming in. Some clarifying questions below.





Trying to answer :-)



>>> 2. Which is the preferable backend? I’m not familiar with the pros/cons of 
>>> the options but I think looking around at some docs that Hazlecast is a 
>>> ‘safe’ good bet?

>

>> We currently support Guava and Ehcache for local caches, Hazelcast,     and 
>> Infinispan for clusters. AFAIK currently Hazelcast is the most stable and 
>> tested one, and it's currently used in production.

>

> Guava for single-instance setups, otherwise Hazelcast. The other backends are 
> more experimental. Infinispan is powerful in large setups, because it also 
> supports dedicated cluster servers (HotRod Server), but this has not been 
> tested extensively and is significantly
 more complex and has more overhead. EHCache has a bit more intelligent memory 
management (i.e. it expires cached objects based on the memory they occupy, 
while all other backends simply take object counts, so when you have many large 
objects you might run
 into out-of-memory situations), but otherwise introduces more overhead than 
Guava.

>



So to get started I would make two instances of tomcat and two instances of 
marmotta (ideally separate hardware). I would configure the cluster settings on 
both as such,



clustering.address = 226.6.7.8

clusting.backend = hazlecast

clustering.enabled = on

clustering.mode = replicated

clustering.name
 = marmotta

clustering.port = 46655





There is a typo there ("hazelcast"). Also you should change clustering.mode to 
"distributed", because Hazelcast does not support "replicated". Not a problem 
if you forget it, but you will get a warning in the log. ;-)




while making sure that .name and .port are not in use. I would then connect 
them to the same database backend from the default H2. (I need to use mySQL, we 
predominantly use Oracle, not pSQL).





You would then connect to the same database server (not sure if you referred to 
this when saying "backend"). More elegant is to set the database.url in the 
Zookeeper tree and let Marmotta retrieve it from there.



I am not sure about MySQL (we are not really working much with it), but you are 
welcome to try (and maybe tell us your experiences).


- Or would I connect mySQL db first and then do the cluster config?





The order doesn't really matter as long as no users are really accessing the 
system in parallel.


- If I installed and configure one instance to use clustering first, configure 
it to use zookeeper, configure the backend DB; then installed a new instance of 
marmotta and hooked that to zookeeper, is magic going to happen and it will 
sort out all the settings
 for each?





That's the plan (this is how we use it). ;-)



Unfortunately, all configuration you store via the user interface will only be 
stored in the local configuration file, because Marmotta does not know whether 
your configuration setting should affect all marmotta instances, those in a 
cluster, or only the
 single instance you are configuring.



For this to work you need to make sure you understand the way configuration is 
stored in Zookeeper. Look at http://marmotta.apache.org/platform/cloud.html;
 the Zookeeper tree contains three levels of Marmotta configurations:
- global level contains configuration applied to all Marmotta instances using 
this Zookeeper server
- cluster level contains configuration applied to all Marmotta instances in a 
named cluster (Servlet context init parameter zookeeper.cluster), e.g. for 
database URL
- instance level contains configuration applied to a single Marmotta instance 
(e.g. for turning on logging or such things)



If you are interested I can send you a sample dump of a Zookeeper tree we are 
using.







If the 'master' is
data.example.com and the 'replicant' is
data-02.example.com, I still make my rdf with an IRI of the master? Will the 
replicant have a different IRI and thus I would need to reference a different 
resource at that endpoint? Or do I ‘cheat’ and put the same IRI in the 
system-config.properties





You would put a load balancer (e.g. standard Apache HTTPD) in front of your two 
tomcats and configure the Marmottas to use the same IRI for resources (which 
has to match the IRI of your load balancer). You can do this by manually 
setting the configuration
 variables kiwi.context (sets the prefix for constructing URIs) and kiwi.host 
(sets the prefix for accessing the admin UI).




Since I already have one instance up and containing data, would it work to 
bring the second instance online, configure the cluster settings of both and 
then the database and have it synchronize the data?





Since they are both accessing the same database, there is no need to 
synchronize manually. Caches are then automatically synchronized by Hazelcast 
(as soon as it has finished discovering its peers).




> Clustering in Marmotta generally won't help you with big datasets. But it 
> will help you with high concurrent loads. The clustering functionality 
> currently implemented essentially provides two features:

> - a cluster-wide cache so that database lookups for frequently used nodes and 
> triples can be reduced; this won't help you if you are always requesting 
> different data or run SPARQL queries; it will help you if you are repeatedly 
> accessing the same nodes and
 triples

> - a cluster-wide synchronization and locking mechanism to make sure the 
> cluster members all share the same data and no inconsistencies are created; 
> this will actually SLOW DOWN your single-process operations and is useful 
> only in highly concurrent setups

>

> If you want to improve performance for single-dataset single-user situations, 
> don't use the clustering mechanism. Use and tune the PostgreSQL database 
> backend instead. Make sure you read

http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

>



I'm not super concerned with speed at the moment. My goal for phase 1 is to 
develop an ontology for the

applications that will hook into the system. Phase 1.5 is making LDPath apps 
that will return select entries via json to use in a webapp. The webapp 
collects data from these curated lists and submits the changes back to 
marmotta. My issue is that even the uniprot
 dataset is 85GB compressed rdf-xml so it will be difficult to work with the 
amount of data I would need to properly connect LDPath apps to clients. I was 
hoping I could have at least different contexts if not an even spread of data 
among the instances.








85 GB should not be terribly much, especially for read access. Did you try 
loading it with the marmotta loader? It will take some time, but should work. 
Make sure to tune your database first, though (not sure about the optimal 
settings for MySQL).




>>> I also found the Apache Giraph project which claims to offer native 
>>> node/edge processing for graph databases. Has anyone used this? I would be 
>>> *very* interested to play around if it could connect to Marmotta.

>

>> We have a experimental backend that uses Titan DB. If you'd be great if 
>> someone could evolve Marmotta in that direction!

>

> Giraph serves a different purpose, it is a highly scalable graph processing 
> framework, not a database. As such, it allows you to parallelize typical 
> graph operations (like shortest path computations) and run them on a Hadoop 
> cluster. This is totally different
 to the kind of operations needed by Marmotta (e.g. to support SPARQL 
querying). If you would like to have a clustered database backend, you could 
try the Titan backend with HBase or Cassandra, but I am not completely 
convinced it will be faster than PostgreSQL.

>



Yes, I have been looking into Giraph to ideally serve as a native processing 
engine to Marmotta. As I start to fill the database with connections I make (on 
top of established datasets like uniprot) I was looking into apps that could 
start to find trends
 in my data. It would be ideal to have an app connect in natively rather than 
trying to load some new permutation of data into hadoop and running it there. 
My use case is that I will have **many** triples with a known predicate (middle 
column) and a plain literal
 value for value (third column). I would like it to work back to all known 
connections and start to find commonalities. It’s easy in principle to think 
about and not super difficult to program. Think principle component analysis 
for each related node of a source
 node.





You could also work directly on the Marmotta database in SQL. The triple format 
is easy enough to understand. ;-)




>>> Lastly, What are people using to manage there ontologies? I found Protege a 
>>> while back and installed WebProtege to manage ontologies. Is it possible 
>>> that it connects to marmotta to keep the ontology synchronized? Are there 
>>> any cool things WebProtege (or
 any ontology manager) can do with Marmotta?

>>>

> I am using emacs for managing ontologies ;-)

>



So, using an ‘offline’ ontology manager means you would make your changes to 
terms, save out a versioned copy, diff it against the last export and run it 
against your marmotta to establish the new ontology in DB?








Depends, probably nowadays I would use SPARQL Update queries to apply changes 
to the ontology. I am not aware of any nice tools for this, though.

Greetings,



Sebastian






This email message and any attachments are confidential and intended for use by 
the addressee(s) only. If you are not the intended recipient, please notify me 
immediately by replying to this message, and destroy all copies of this message 
and any attachments. Thank you.

RE: Zookeeper, Giraph, and WebProtege

Reply via email to