Re: Modelling complex data structures (graphs and trees for example)

Colin Yates Sat, 09 Jul 2011 06:15:47 -0700

I did think about moving this logic to the database, but I am toying around
with a different model - having the entire data set in memory (possibly
across multiple nodes using messaging infrastructure to communicate).  The
reason for this is:

 - writes are very small but reads are very high
 - each read typically requires complex processing
 - most operations cover a large part of the entire dataset

Paying the cost of having the entire data set *efficiently* available for
the application (Clojure in this case) means:

 - less dependence on (probably hard to test) yet-another-bit-of-tech.
 Integration testing DAOs or Repositories always seems like a lot of work.
 Reducing the technical pieces just makes things much easier
 - I am hoping clever use of persistent structures will help here, as there
is a lot of commonality in the data itself (i.e. 5 projects might actually
share 80% of the same state).  Clever use in constructing these might pay
dividends...
 - I don't think I can offload *all* processing onto a third party
technology so I need the ability to deal with large data sets in memory with
real-time (whatever that means) - if I need it for one, I may as well use it
for all.

Ambitious, and full of hairy concerns!  But the idea of moving away from
single-threaded web-based applications with big powerful data engines to a
single chunk of logic that occasionally throws state to a fairly dumb
persistent store is certainly not new ground, and seems to offer a much more
powerful architecture.

For example, dealing with historical data is always a pain point.  What I
want is the ability to snapshot the entire system whenever anything changes,
to allow us to see how the system (or client rather) has improved.  In a
relational database, this would be ridiculous, so I captured a "snapshot of
interesting data".  Tomorrow they realise that something else was
interesting....  We also played with document stores (MongoDB) which makes
the job much much smaller - just cloning a single document (and related
data), but then it has to be hydrated, so for ease of use a snapshot is
taken every X period, even if the data hasn't changed.  Yuck.

Now Clojure appears, with its extremely efficient (in terms of memory) way
of storing data, and suddenly it feels like storing a representation every
time the structure changes (which is only once or twice a week) and then
realising the entire history in memory is now do-able.  This means if a
Project only changed 5 times over a 3 month period there would only be 5
instances of that project in storage.  Calculating how each project
contributes to a historical chart broken down by day (or hour whatever) is
much much easier to do in Java/Clojure/whatever than third party store of
choice.  I am asserting that providing a sequence for a project for every
day over the last year when there are only 5 snapshots will certainly not
consume sizeOfProject * daysInYear memory.

(Not sure that was the best example of the pain points I am trying to solve
actually :), but anyway).

I guess, after 15 years of using the "web, app-logic, database"
template-cutter I am giving myself a clean piece of paper and asking "what
do you want to do and what is the simplest way to do it", and keeping
everything in the application layer (rather than the persistence layer)
seems appealing.

We aren't dealing with billions of rows - I still need to experiment, but it
feels like having our entire data set in memory is possible on a fairly
beefy server.  I appreciate the JVM isn't the best wrt huge heaps, but I can
work around that (with multiple virtual machines each running their own JVM
and using ActiveMQ for example).  Clojure's STM seems to be the final step
on the ladder to reach this goal.

I have previously considered CouchDB (for its views), Hadoop (for its highly
scalable and parallelisable map/reduce execution), Cassandra for its ability
to store huge amounts of highly nested structure, Neo4j to store large
numbers of small nodes that are heavily inter-related.  And of course,
MongoDB, which I am currently using in production.  I also considered Erlang
and Scala for their distributed VM actor models, but I am really really sold
on the power of LISP macros.

I dunno - might be a fool's errand, but spreading the complexity over that
much technology just seems like hard work.  *If* the working set can be
stored in current memory then I think a much simpler, and much more powerful
solution will emerge.  Sure, I am putting all my eggs in
Clojure+my-own-ability, but at risk of re-inventing the wheel, but maybe
that is the right thing to do  - building the simplest and most elegant
solution with new tools.

I probably ate something that disagreed with me, but I just want to break
free from the shackles of these heavy-weight tools and fly!  OK - that's
enough.

Or, it might all be a catastrophic failure and I will be signing up to
Careers 2.0 :)

Col

P.S>  Usual disclaimer - still only written three lines of Clojure :)

On 8 July 2011 20:57, James Keats <james.w.ke...@gmail.com> wrote:

>
>
> On Jun 16, 3:08 pm, Colin Yates <colin.ya...@gmail.com> wrote:
> > (newbie warning)
> >
> > Our current solution is an OO implementation in Groovy and Java.  We
> > have a (mutable) Project which has a DAG (directed acyclic graph).
> > This is stored as a set of nodes and edges.  There are multiple
> > implementations of nodes (which may themselves be Projects).  There
> > are also multiple implementations of edges.
> >
> > My question isn't how to do this in a functional paradigm, my first
> > question is *how do I learn* to do this in a functional paradigm.  I
> > want to be able to get the answer myself ;).  To that end, are there
> > any "domain driven design with functional programming" type resources?
> >
> > A more specific question is how do I model a graph?  These graphs can
> > be quite extensive, with mutations on the individual nodes as well as
> > the structure (i.e. adding or removing branches).  Does this mean that
> > every every node would be a ref?  I think the general answer is that
> > the aggregate roots are refs, meaning they are an atomic block, but is
> > there any more guidance?
>
> May I humbly suggest that this ought to be a database-ish concern
> rather than a middleware one? have you looked at neo4j for example? A
> quick google found this:
>
> http://wiki.neo4j.org/content/Roles
>
> "This is an implementation of an example found in the article A Model
> to Represent Directed Acyclic Graphs (DAG) on SQL Databases by Kemal
> Erdogan. ... In Neo4j storing the roles is trivial, as working with
> graphs is what Neo4j was designed for"
>
> I would humbly suggest that you use as much of the database
> functionality as possible for your data needs and avoid replicating it
> in your middleware. I hope this works. :-)
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Modelling complex data structures (graphs and trees for example)

Reply via email to