Re: [elephant-devel] Querying Advice

Ian Eslick Wed, 28 Feb 2007 14:33:24 -0800

Daniel,

Not sure if you're still interested, but I got motivated to startthinking about a query strategy for Elephant. I have a much moresophisticated version I'll check into contrib/eslick/queries in a dayor two, but my first experimental hack can be found in the mainelephant directory in query.lisp and query-example.lisp.

They're not loaded by default in the .asd file so if you want to playwith them load them manually and check out the queries in query-example.lisp. I'm changing the constraint syntax in the new version,but I'd be interested in any feedback that query.lisp might generate.

This version does not support join constraints (the returnedinstances being filtered based on a constrained subset of instancesof another class).

The plan is to provide a little syntax language and a map interfacethat calls a function on each instance of the instances that satisfythe query. The goal of this approach is to avoid having to load allobjects into memory by default. I'm not sure yet if this map willallow for side effects (remove all objects satisfying the query)unless we have a special header that says 'delete these' so we canuse the underlying cursor-del function.

The constraint language should be extensible, ultimately, but I'vebeen fighting with syntax and semantics for awhile so am going to putthat off. So the query process is:


- Parse syntax into a constraint graph
- Transform constraint graph into a query planning graph
- Find an efficient (optimal?) query plan
- Execute plan (interpreted)

Once the syntax, graphs and planner is cleanly fleshed out, it shouldbe easy to perform the first three steps inside a macro and generatecompiled code from the query plan instead of interpreting it at runtime.

If anyone is interested in having a deeper discussion on this, we canput it up on the new Trac Wiki (see link from new web page).


Cheers,
Ian

PS - This is a distraction from chasing the platform bugs related togetting 0.6.1 wrapped up. If anyone wants to move the 0.6.1 releasealong, you could work on OpenMCL integration, adding a few featuresto migration or adding a couple of simple tests (see TODO file inelephant root directory).



On Nov 15, 2006, at 12:16 PM, Daniel Salama wrote:

Ian,

Thank you for the detailed explanation.
I think, at least, I was under the impression that Elephant was anODB. At least, that's how the first sentence of the web sitepresents it: "Elephant is an object database for Common Lisp" :)Maybe this could confuse other newcomers so you may want to revisitthe web site.
Now, under the understanding that it is a framework to"transparently" persist data, it makes it more apparent that anyobject model needs to be done in Lisp and in memory and Elephantwill simply aid in persist the data in Lisp's memory (at least thatwhich you want to persist).
Going back to all this email thread, makes me wonder, whichdirection do you/we all want Elephant to go to. In your email youmention that it would be nice to incorporation some ODB-likefeatures into Elephant. Well, I think the approach to query datamay be affected as to the direction we want to take. If we justhave a framework to query data, it will serve one purpose. If weincorporate a "layer or two" to make it more ODB-like, that couldchange the panorama of where Elephant goes, potentially involvingchanges to DCM and also whatever querying framework is "developed".After all, it could end up looking something like CL-SQL with theability to _also_ use BDB as the backend :)
As for the indexing and deletion, it makes more sense. I still needto understand better how class and slot indices work. From yourexplanation, it seems as if the Elephant machinery is automaticallycreating these BTree indices on the class/slots. I will look intothe code for the get-instance* methods to learn more how it worksbehind the scenes (I suppose that's more like the Lispy mentalityof learning by looking at the library's source code - somethingwhich I still need to get used to)
Will probably bother more later as questions/concerns come up.However, as I mentioned before, I'd like to contribute to theproject, specially in the querying layer/framework and possibly, ifthat's where we go, the ODB layer(s). I just need to know what willbe the direction we want to go.
Thanks again,
Daniel

On Nov 15, 2006, at 6:58 AM, Ian Eslick wrote:
Daniel,

There are a couple of things about the Elephant model that should be
made clear. Elephant is not a very high level DB system. It'sfocus is
persistence of data.  In effect, it is a collection of indexing
mechanisms and a metaobject protocol that helps us to store/retrieve
data.   There are three ways to persist data.
1) Put it in the root, or another BTree that you have manuallycreated
and put in the rot
2) Create a persistent object. A persistent object has twomanifestations:a) A stub containing an OID and a reference to the controllerthe
object is stored in
     b) A set of slot values stored in a master slot-value BTree

When you do a slot access, the OID and slot name are used to go into
this (hidden) master slot-value BTree to find the actual slot value.
This is done as a tradeoff so if you are writing an integer slotin an
object with a large string, you don't have to load that string into
memory to get ACID properties on that integer slot.

When you load a persistent object from the root or another index, all
that is loaded into memory is the stub.  All slot accesses go
independently back to the on-disk store to get their values. Thismeansthat the on-disk value and locking together guarantee that anynumber of
processes or threads within a process can use the same Elephant store
and guarantee ACID semantics (BDB & SQL handle locking for us).  For
most lisp applications with a single image, bookkeeping like Robert's
system make more sense, but you have to worry about locking yourself.
DCM pre-loads all the slot values but updates the on-disk version on
writes for CID properties - atomicity has to be explicitly managed.
We're hoping to make slot caching a natural feature of elephanteventually.
3) Use slot and class indices.  Underneath, this is just #1 with
functions to make the BTrees simple to use. When you created anindexedobject, the object is added to a BTree and a secondary index iscreatedfor slot values on top of the master slot-value index. It's theclass
index that guarantees you can reach the object.
Now if you are hacking on elephant, this is the mental model youneed.Eventually we're hoping that it becomes sophisticated enough yourarely
need this level of detail (query language, flexible persistence
semantics, etc).  However the persistence of lisp values and the
persistent vs. normal class objects means it will never be perfect.
Manual deletions are actually quite hard to do in such a system,becauseif you delete an object and forget to delete the references thenyou endup with a corrupted data representation that will fail silently(returngarbage or null slot values) if you reload a reference. Alsodeletingan object means removing it from all referring data structures andalso
deleting (separately) each slot value.  Thus the only safe method of
deletion is GC which is not yet implemented except by manuallymigratingyour DB which copies everything reachable from root & class-root.I'm alittle worried for large DBs that GC can't happen when the DBitself issome factor larger than the available working memory. I know howto fix
that, but it's a little involved.

Some more comments below.

PS - All BDB log files start out at 10MB.  They store bookkeeping
information so take up more room than you might expect from thestored
data itself.

Daniel Salama wrote:
Ian et al,

Based on my comment to Robert and that of Pierre, could you, or
anyone, please clarify this for me (and maybe others):
If making a class persistent means that there is no need to addit to
root or to any persistent collection, when I look at Robert's sample
code, I see (excerpts):

(defclass User ()
  ((username :type 'string :initform "" :initarg :uname :accessor
username-of)
   (password :type 'string :initform "" :initarg :pword :accessor
password-of)
(email :type 'string :initform "" :initarg :email :accessoremail-of)
   (fullname :type 'string :initform "" :initarg :fullname :accessor
fullname-of)
   (balance :type 'integer :initform 0 :initarg :balance :accessor
balance-of)))

(defun random-users (n)
  (dotimes (x n)
    (let ((u (make-instance
              'User
              :uname (format nil "user~A" x)
              :pword (random-password)
              :email (format nil "[EMAIL PROTECTED]" x)
              :fullname (format nil "~A~A ~A~A" (random-password) x
(random-password) x)
              :balance (random 100))))
      (add-to-root x u))))
There is an explicit add-to-root in random-users. I suppose thereason
for this is because User class does not inherit from
persistent-metaclass and in order to be able to "search for" or
retrieve that object (could this also be the reason for theadditional
storage overhead, as you pointed out yesterday?). Right? So, if my
understanding is correct, defining User with defpclass instead would
mean that you don't have to add-to-root because it will be
automatically persisted. However, after the function exits, therewill
be no reference to that persistent object and will therefore be
eventually garbage collected (whether or not the persistent spacewillbe reclaimed is a different story, as you mentioned in youremail). Is
that right? If so, how could I avoid for the User objects to be
garbage collected in this case, since there really is no other
reference to these objects after creating them? Or, if theobjects are
NOT garbage collected, how could I manually "delete" any of them?

Now, if I (or Robert) had defined the User class as:

(defpclass User ()
  ((username :type 'string :initform "" :initarg :uname :accessor
username-of :index t)
   (password :type 'string :initform "" :initarg :pword :accessor
password-of)
(email :type 'string :initform "" :initarg :email :accessoremail-of)
   (fullname :type 'string :initform "" :initarg :fullname :accessor
fullname-of)
   (balance :type 'integer :initform 0 :initarg :balance :accessor
balance-of)))

where (notice how it's defined with defpclass) the username slot is
indexed, the system would automatically store a reference to the
object in the slot index, and there would be no need to use the
add-to-root in random-users. Correct? If that's the case, howwould Ithen go about removing this user object from persistence? Wouldit be
by setting the indexed slot value to NIL?
See above.
On another note, if I want to create a collection of users, I don't
have to store these users in a collection. Simply making theminheritfrom persistent-metaclass and indexing them would do soautomatically
(just like the example above). Right? How about this, then:

(asdf:operate 'asdf:load-op :elephant-tests)

(in-package :elephant-tests)

(setf *default-spec* *testbdb-spec*)

(open-store *default-spec*)

(defpclass state ()
  ((abbr :type 'string :initform "" :initarg :abbr :accessor abbr-of
:index t)
(name :type 'string :initform "" :initarg :name :accessor name-of)))
(defpclass zip-code ()
  ((zip :type 'string :initform "" :initarg :zip :accessor zip-of
:index t)
(city :type 'string :initform "" :initarg :city :accessor city-of)
   (county :type 'string :initform "" :initarg :county :accessor
county-of)
   (state :initform nil :initarg :state :accessor state-of)))

(defmethod print-object ((obj state) stream)
  (format stream "State (abbr, name) = (~A, ~A)" (abbr-of obj)
(name-of obj)))

(defmethod print-object ((obj zip-code) stream)
  (format stream "Zip (zip, city, county, state) = (~A, ~A, ~A, ~A)"
(zip-of obj) (city-of obj) (county-of obj) (state-ofobj)))
(let* ((s1 (make-instance 'state :abbr "FL" :name "Florida"))
      (s2 (make-instance 'state :abbr "NY" :name "New York"))
(z1 (make-instance 'zip-code :zip "33015" :city"Miami" :county
"Dade" :state s1))
(z2 (make-instance 'zip-code :zip "13605" :city"Adams" :county
"Jefferson" :state s2))
      (z2 (make-instance 'zip-code :zip "33160" :city "Sunny Isles
Beach" :county "Dade" :state s1)))
  (print s1)
  (print s2)
  (print z1)
  (print z2)
  (print z3))
Here, I'm creating a couple of state objects and a couple of zip-code
objects. Since a zip-code can only belong to one state, they have a
reference back to the state in order to "quickly" determine thestate
they belong to.

A couple of questions/comments here:

1) If I wanted to ask a state to give me all the zip-code(s) within
it, would I create a slot in the state class to hold a collection of
zip-code references? Or would I simply create a state-classmethod like:
(defmethod get-zip-codes ((obj state))
  (get-instances-by-value 'zip-code 'state obj))
This does not work because the state slot in zip-code is notindexed.Also, from the code above, I don't know how to get a reference tothebtree in order to create a cursor so that I can linearly traversethezip-code(s) to return only those zip-code(s) which belong to thatstate.
Yes, you have to decide what queries you want to do and make trade-offsbetween space and time. Of course it's easy to make the decision"lots
of zip codes, small ranges in queries, large number of zip codes per
state = index OR moderate # zip codes, large ranges per state, small
number of states = add a slot and use a list or array" Then youadd the:index marker to the zip code or add a new slot to state to keeptrackof the association. There is no general notion of association asyou'llsee below. However that might be a nice thing to add - you startto get
relational notion into the simple object persistence model.
I guess that's the confusion. Elephant is NOT an object database- it
is a persistent object and collection facility.  The DB functionality
you (today) have to write yourself. It might be nice to add alayer ortwo that makes it more ODB like. But as Robert has done with DCM- you
can craft an object/storage model that works for your application.
Keeping references to objects in slots is an implicit DB but it does
require that you think in a way that RDB's do not.
Of course, I would probably want to index the state slot of zip-code
because there are tens of thousands of zip codes and I wouldn't want
to linearly traverse them, but I just wanted to illustrate theproblem
to get some additional feedback (the overhead of maintaining a
secondary index on state wouldn't matter too much to me because
changes to this class/objects are very rare but access is much more
frequent)

2) Maybe this is a totally different issue and mainly caused by my
lisp ignorance:

If I have:

(defparameter *s1* (get-instances-by-value 'state 'abbr "FL"))
(defparameter *s2* (get-instances-by-value 'state 'abbr "NY"))
(defparameter *z1* (get-instances-by-value 'zip-code 'zip "33015"))
(defparameter *z2* (get-instances-by-value 'zip-code 'zip "13605"))
(defparameter *z3* (get-instances-by-value 'zip-code 'zip "33160"))

and I want to remove the association of zip code 33160 from the "FL"
state object, I would think all I have to do is:

(setf (state-of *z3*) nil)
First problem here is that get-instances (note the plural) returns a
list.  However that will work.   You orphan the zip code by removing
it's reference to a state.  However there is a singular function that
returns the first instance or nil.

(defparameter *z3* (get-instance-by-value 'zip-code 'zip "13605)))

(setf (state-of *z3*) nil)


_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel

Re: [elephant-devel] Querying Advice

Reply via email to