Re: Zookeeper, Giraph, and WebProtege

Sebastian Schaffert Fri, 07 Nov 2014 00:23:02 -0800

Hi Joshua,

there is not exactly a howto, but the main steps would be:
1) setup the backend server that is going to be used by Titan (e.g. HBase)
2) use the MarmottaLoader (single jar) for the respective backend (e.g.
marmotta-loader-hbase) to load your triple data into it in bulk mode
3) copy the launchers/marmotta-webapp folder to a separate project
directory and update the pom.xml so that instead of marmotta-backend-kiwi
it uses marmotta-backend-titan; remove all other dependencies to kiwi
modules (versioning, reasoning)
4) run mvn clean tomcat7:run in that directory
5) go to configuration and set properties according to the titan
documentation; if a property in Titan is called FOO, the configuration name
in Marmotta is titan.FOO


In step 5 you would e.g. switch from the BerkeleyDB Titan backend used by
default to HBase. Follow the documentation on the Titan web page from there.

Greetings,

Sebastian

2014-11-06 22:44 GMT+01:00 Joshua Dunham <joshua_dun...@vrtx.com>:

> Hi Sebastian, Marmotta Users,
>
>
> Reviving this thread a little bit, is there a howto for trying out
> TitanDB? I'm interested in titan/cassandra doing some of the heavy lifting
> for the backend work. Titan espicially has some **very** interesting
> features (TinkerPop apps and native hadoop2 support).
>
> I could be a guinea pig in this regard. :)
>
> -J
>
>
>
>
>
> Joshua Dunham
> Exploratory App Development | Vertex
> E: joshua_dun...@vrtx.com
> P: 617-229-5157
> W: http://www.vrtx.com
> L: http://linkedin.com/in/joshuadunham
> S: joshua.dunham
>
> From: Sebastian Schaffert [sebastian.schaff...@gmail.com]
>
> Sent: Tuesday, October 07, 2014 06:45
>
> To: users@marmotta.apache.org
>
> Subject: Re: Zookeeper, Giraph, and WebProtege
>
>
>
>
>
>
> Hi Joshua,
>
>
>
> 2014-10-06 22:42 GMT+02:00 Joshua Dunham
> <joshua_dun...@vrtx.com>:
>
>
> Hi Sebastian,
>
>
>
>    Thank you for chiming in. Some clarifying questions below.
>
>
>
>
>
> Trying to answer :-)
>
>
>
> >>> 2. Which is the preferable backend? I’m not familiar with the
> pros/cons of the options but I think looking around at some docs that
> Hazlecast is a ‘safe’ good bet?
>
> >
>
> >> We currently support Guava and Ehcache for local caches, Hazelcast,
>  and Infinispan for clusters. AFAIK currently Hazelcast is the most stable
> and tested one, and it's currently used in production.
>
> >
>
> > Guava for single-instance setups, otherwise Hazelcast. The other
> backends are more experimental. Infinispan is powerful in large setups,
> because it also supports dedicated cluster servers (HotRod Server), but
> this has not been tested extensively and is significantly
>  more complex and has more overhead. EHCache has a bit more intelligent
> memory management (i.e. it expires cached objects based on the memory they
> occupy, while all other backends simply take object counts, so when you
> have many large objects you might run
>  into out-of-memory situations), but otherwise introduces more overhead
> than Guava.
>
> >
>
>
>
> So to get started I would make two instances of tomcat and two instances
> of marmotta (ideally separate hardware). I would configure the cluster
> settings on both as such,
>
>
>
> clustering.address = 226.6.7.8
>
> clusting.backend = hazlecast
>
> clustering.enabled = on
>
> clustering.mode = replicated
>
> clustering.name
>  = marmotta
>
> clustering.port = 46655
>
>
>
>
>
> There is a typo there ("hazelcast"). Also you should change
> clustering.mode to "distributed", because Hazelcast does not support
> "replicated". Not a problem if you forget it, but you will get a warning in
> the log. ;-)
>
>
>
>
> while making sure that .name and .port are not in use. I would then
> connect them to the same database backend from the default H2. (I need to
> use mySQL, we predominantly use Oracle, not pSQL).
>
>
>
>
>
> You would then connect to the same database server (not sure if you
> referred to this when saying "backend"). More elegant is to set the
> database.url in the Zookeeper tree and let Marmotta retrieve it from there.
>
>
>
> I am not sure about MySQL (we are not really working much with it), but
> you are welcome to try (and maybe tell us your experiences).
>
>
> - Or would I connect mySQL db first and then do the cluster config?
>
>
>
>
>
> The order doesn't really matter as long as no users are really accessing
> the system in parallel.
>
>
> - If I installed and configure one instance to use clustering first,
> configure it to use zookeeper, configure the backend DB; then installed a
> new instance of marmotta and hooked that to zookeeper, is magic going to
> happen and it will sort out all the settings
>  for each?
>
>
>
>
>
> That's the plan (this is how we use it). ;-)
>
>
>
> Unfortunately, all configuration you store via the user interface will
> only be stored in the local configuration file, because Marmotta does not
> know whether your configuration setting should affect all marmotta
> instances, those in a cluster, or only the
>  single instance you are configuring.
>
>
>
> For this to work you need to make sure you understand the way
> configuration is stored in Zookeeper. Look at
> http://marmotta.apache.org/platform/cloud.html;
>  the Zookeeper tree contains three levels of Marmotta configurations:
> - global level contains configuration applied to all Marmotta instances
> using this Zookeeper server
> - cluster level contains configuration applied to all Marmotta instances
> in a named cluster (Servlet context init parameter zookeeper.cluster), e.g.
> for database URL
> - instance level contains configuration applied to a single Marmotta
> instance (e.g. for turning on logging or such things)
>
>
>
> If you are interested I can send you a sample dump of a Zookeeper tree we
> are using.
>
>
>
>
>
>
>
> If the 'master' is
> data.example.com and the 'replicant' is
> data-02.example.com, I still make my rdf with an IRI of the master? Will
> the replicant have a different IRI and thus I would need to reference a
> different resource at that endpoint? Or do I ‘cheat’ and put the same IRI
> in the system-config.properties
>
>
>
>
>
> You would put a load balancer (e.g. standard Apache HTTPD) in front of
> your two tomcats and configure the Marmottas to use the same IRI for
> resources (which has to match the IRI of your load balancer). You can do
> this by manually setting the configuration
>  variables kiwi.context (sets the prefix for constructing URIs) and
> kiwi.host (sets the prefix for accessing the admin UI).
>
>
>
>
> Since I already have one instance up and containing data, would it work to
> bring the second instance online, configure the cluster settings of both
> and then the database and have it synchronize the data?
>
>
>
>
>
> Since they are both accessing the same database, there is no need to
> synchronize manually. Caches are then automatically synchronized by
> Hazelcast (as soon as it has finished discovering its peers).
>
>
>
>
> > Clustering in Marmotta generally won't help you with big datasets. But
> it will help you with high concurrent loads. The clustering functionality
> currently implemented essentially provides two features:
>
> > - a cluster-wide cache so that database lookups for frequently used
> nodes and triples can be reduced; this won't help you if you are always
> requesting different data or run SPARQL queries; it will help you if you
> are repeatedly accessing the same nodes and
>  triples
>
> > - a cluster-wide synchronization and locking mechanism to make sure the
> cluster members all share the same data and no inconsistencies are created;
> this will actually SLOW DOWN your single-process operations and is useful
> only in highly concurrent setups
>
> >
>
> > If you want to improve performance for single-dataset single-user
> situations, don't use the clustering mechanism. Use and tune the PostgreSQL
> database backend instead. Make sure you read
>
> http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
>
> >
>
>
>
> I'm not super concerned with speed at the moment. My goal for phase 1 is
> to develop an ontology for the
>
> applications that will hook into the system. Phase 1.5 is making LDPath
> apps that will return select entries via json to use in a webapp. The
> webapp collects data from these curated lists and submits the changes back
> to marmotta. My issue is that even the uniprot
>  dataset is 85GB compressed rdf-xml so it will be difficult to work with
> the amount of data I would need to properly connect LDPath apps to clients.
> I was hoping I could have at least different contexts if not an even spread
> of data among the instances.
>
>
>
>
>
>
>
>
> 85 GB should not be terribly much, especially for read access. Did you try
> loading it with the marmotta loader? It will take some time, but should
> work. Make sure to tune your database first, though (not sure about the
> optimal settings for MySQL).
>
>
>
>
> >>> I also found the Apache Giraph project which claims to offer native
> node/edge processing for graph databases. Has anyone used this? I would be
> *very* interested to play around if it could connect to Marmotta.
>
> >
>
> >> We have a experimental backend that uses Titan DB. If you'd be great if
> someone could evolve Marmotta in that direction!
>
> >
>
> > Giraph serves a different purpose, it is a highly scalable graph
> processing framework, not a database. As such, it allows you to parallelize
> typical graph operations (like shortest path computations) and run them on
> a Hadoop cluster. This is totally different
>  to the kind of operations needed by Marmotta (e.g. to support SPARQL
> querying). If you would like to have a clustered database backend, you
> could try the Titan backend with HBase or Cassandra, but I am not
> completely convinced it will be faster than PostgreSQL.
>
> >
>
>
>
> Yes, I have been looking into Giraph to ideally serve as a native
> processing engine to Marmotta. As I start to fill the database with
> connections I make (on top of established datasets like uniprot) I was
> looking into apps that could start to find trends
>  in my data. It would be ideal to have an app connect in natively rather
> than trying to load some new permutation of data into hadoop and running it
> there. My use case is that I will have **many** triples with a known
> predicate (middle column) and a plain literal
>  value for value (third column). I would like it to work back to all known
> connections and start to find commonalities. It’s easy in principle to
> think about and not super difficult to program. Think principle component
> analysis for each related node of a source
>  node.
>
>
>
>
>
> You could also work directly on the Marmotta database in SQL. The triple
> format is easy enough to understand. ;-)
>
>
>
>
> >>> Lastly, What are people using to manage there ontologies? I found
> Protege a while back and installed WebProtege to manage ontologies. Is it
> possible that it connects to marmotta to keep the ontology synchronized?
> Are there any cool things WebProtege (or
>  any ontology manager) can do with Marmotta?
>
> >>>
>
> > I am using emacs for managing ontologies ;-)
>
> >
>
>
>
> So, using an ‘offline’ ontology manager means you would make your changes
> to terms, save out a versioned copy, diff it against the last export and
> run it against your marmotta to establish the new ontology in DB?
>
>
>
>
>
>
>
>
> Depends, probably nowadays I would use SPARQL Update queries to apply
> changes to the ontology. I am not aware of any nice tools for this, though.
>
> Greetings,
>
>
>
> Sebastian
>
>
>
>
>
>
> This email message and any attachments are confidential and intended for
> use by the addressee(s) only. If you are not the intended recipient, please
> notify me immediately by replying to this message, and destroy all copies
> of this message and any attachments. Thank you.
>

Re: Zookeeper, Giraph, and WebProtege

Reply via email to