On Thu, Dec 18, 2008 at 11:06 PM, Chouser <chou...@gmail.com> wrote:
>
> I've pondered a couple approaches, though only enough to find
> problems.
>
> One approach would work act like a Clojure collection, with structural
> sharing on-disk.  This would be great because it would have
> multi-versioning and transaction features built right in.  It would
> also have the potential to cache some data in memory while managing
> reads and writes to disk.
>
> But Clojure's persistent collections rely on garbage collection --
> when old versions of the collection are no longer referenced, the JVM
> cleans them up automatically.  How would this work on disk?  How would
> you define "no longer referenced"?

[1. just to confirm I understand you correctly: is this a
"transparent" case where each newly constructed data structure is
immediately written to disk?]

I can see another problem here: assuming we have multiple "active"
versions of such structure at the time of closing the application,
which of them should we restore when the application restarts? Do we
need explicit "save"&"restore" commands to mark the data we want
anyway? Or is it better to automatically keep track of "heads" per
(named) thread? Or use a (transparent) Ref?

Another issue is the efficiency of the GC itself. Whatever scheme we
use, Clojure is going to use GC a lot. This may cause strong
fragmentation of data on disk (not to say about performance penalty).

> Another approach would be at the Ref or Agent level, where watchers
> could be hooked in.  (Watchers are currently only for agents, but are
> planned for refs as well.)  Watchers are functions that are called
> when their underlying mutable object has a change committed, so they'd
> be able to sync the disk up with new in-memory value.  But this means
> the whole collection would have to be in-memory.  Also the watcher
> gets no hint as to *what* in the collection changed.
>
> So for now it seems we'll have to make do with "normal" mechansims
> like SQL libraries.

[2. is this an "opaque" case where data sit in memory and only when we
switch the Ref the on-disk representation is updated?]

Hiding the whole data structure behind the Ref and syncing data only
when it changes would solve both problems mentioned above. The cost is
that the data would not longer be updated incrementally so the whole
structure would have to be flushed to the disk. This wouldn't be very
efficient but would work with existing database back-ends.

Perhaps it would be better to use a combination of 1. and 2., i.e. not
only to hide data structures behind a Ref and commit changes to disk
only when the Ref changes (as in 2.) but also to use a journal for
tracking "modifications" to data so that only incremental changes have
be done. Such journal could be translated (and optimized) into a bunch
of INSERT&DELETE commands.

Another question: what kind of data structures? List, vector and hash
or maybe a new specialized type (a table)? Performance characteristics
are going to be very different from in-memory data so maybe it makes
sense for the whole mechanism to be opaque.

-r.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to